### Manipulation of Peaks de novo results of ETNP 2017 P2 samples LC-MS/MS data using python.

Starting with:

    Peaks de novo results (.csv) of PTM-optimized database searches > ALC 50%

Goal:

    Files with stripped (no PTMs) peptide lists and
    Columns with #'s of each modification in every sequence
    Column with stripped peptide lengths (# amino acids)
    
### To use for a different file:

#### 1. Change the input file name in *IN 4*
#### 2. Use 'find + replace' (Esc + F) to replace the running # (e.g., 233) for another
#### 3. Update the NAAF factor calculated in *IN 6* into *IN 7*

We don't have technical duplicates here, sadly, unlike the MED4 Pro samples. I exported PeaksDN search results CSVs into my ETNP 2017 git repo:

In [5]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt


In [6]:
ls

[0m[01;34m231[0m/  [01;34m233[0m/  [01;34m243[0m/  [01;34m273[0m/  [01;34m278[0m/  [01;34m378[0m/


In [7]:
ls 233

ETNP_SKQ17_DENOVO_162_233-265m-0.3-JA4_DN50_15ppm.csv
ETNP_SKQ17_DENOVO_162_233-265m-0.3-JA4_DN50_15ppm.xml


In [8]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

## 233: 265 m McLane pump filtered on 0.3 um GF-75

In [9]:
# read the CSV into a dataframe using the pandas read_csv function
peaks233_50 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP_SKQ17_DENOVO_162_233-265m-0.3-JA4_DN50_15ppm.csv")

# also make a dataframe but keep only peptides >80% ALC
peaks233_80 = peaks233_50.loc[peaks233_50['ALC (%)'] >= 80].copy()

# how many de novo sequence candidates >50% ALC?
print("# redundant Peaks peptides >50% ALC in dataframe", len(peaks233_50))
# how many de novo sequence candidates >80ALC?
print("# redundant Peaks peptides >80% ALC in dataframe", len(peaks233_80))

#look at the dataframe
peaks233_50.head()

# redundant Peaks peptides >50% ALC in dataframe 4235
# redundant Peaks peptides >80% ALC in dataframe 730


Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,2,14553,20170410__ETNP-233-265m-0.3um-JA4_01.raw,EN(+.98)LAALEK,8,98,8,444.7376,2,49.92,8900000.0,887.46,0.7,Deamidation (NQ),97 99 99 99 99 99 100 97,EN(+.98)LAALEK,CID
1,2,26516,20170410__ETNP-233-265m-0.3um-JA4_01.raw,VGC(+57.02)DEGLFEELPR,13,98,13,760.857,2,81.3,9410000.0,1519.6978,1.1,Carbamidomethylation,92 95 100 100 100 98 99 99 99 100 100 99 99,VGC(+57.02)DEGLFEELPR,CID
2,2,22453,20170410__ETNP-233-265m-0.3um-JA4_01.raw,WSVVFK,6,98,6,383.2187,2,72.23,322000.0,764.4221,1.1,,96 98 98 99 99 99,WSVVFK,CID
3,2,19513,20170410__ETNP-233-265m-0.3um-JA4_01.raw,FDLLVNK,7,97,7,424.748,2,64.92,1520000.0,847.4803,1.3,,94 97 100 99 98 97 98,FDLLVNK,CID
4,2,13937,20170410__ETNP-233-265m-0.3um-JA4_01.raw,EPLGPVVR,8,97,8,433.7584,2,47.83,4020000.0,865.5021,0.2,,97 98 98 96 99 97 99 95,EPLGPVVR,CID


In [10]:
# use a count function to enumerate the # of A's (alanines) in each peptide
peaks233_50['A'] = peaks233_50['Peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
peaks233_50['C'] = peaks233_50['Peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
peaks233_50['D'] = peaks233_50['Peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
peaks233_50['E'] = peaks233_50['Peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
peaks233_50['F'] = peaks233_50['Peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
peaks233_50['G'] = peaks233_50['Peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
peaks233_50['H'] = peaks233_50['Peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in peaks233_50 output, there will be no isoleucines (they're lumped in with leucines)
peaks233_50['I'] = peaks233_50['Peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
peaks233_50['K'] = peaks233_50['Peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
peaks233_50['L'] = peaks233_50['Peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
peaks233_50['M'] = peaks233_50['Peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
peaks233_50['N'] = peaks233_50['Peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
peaks233_50['P'] = peaks233_50['Peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
peaks233_50['Q'] = peaks233_50['Peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
peaks233_50['R'] = peaks233_50['Peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
peaks233_50['S'] = peaks233_50['Peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
peaks233_50['T'] = peaks233_50['Peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
peaks233_50['V'] = peaks233_50['Peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
peaks233_50['W'] = peaks233_50['Peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
peaks233_50['Y'] = peaks233_50['Peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
peaks233_50['c-carb'] = peaks233_50['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks233_50['m-oxid'] = peaks233_50['Peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
peaks233_50['n-deam'] = peaks233_50['Peptide'].apply(lambda x: x.count('N(+.98)'))

# use a count function to enumerate the # of deamidated Q's in each peptide
peaks233_50['q-deam'] = peaks233_50['Peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of hydroxylated K's in each peptide
peaks233_50['k-hydr'] = peaks233_50['Peptide'].apply(lambda x: x.count('K(+15.99)'))

# use a count function to enumerate the # of hydroxylated P's in each peptide
# no more P hydroxylation in final searches
# peaks233_50['p-hydr'] = peaks233_50['Peptide'].apply(lambda x: x.count('P(+15.99)'))

# use a count function to enumerate the # of methylated R's in each peptide
peaks233_50['r-meth'] = peaks233_50['Peptide'].apply(lambda x: x.count('R(+14.02)'))

# create a column with 'stripped' peptide sequences using strip
peaks233_50['stripped peptide'] = peaks233_50['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks233_50['stripped length'] = peaks233_50['stripped peptide'].apply(len)

# total the number of modifications in sequence
peaks233_50['ptm-total'] = peaks233_50['c-carb'] + peaks233_50['m-oxid'] + peaks233_50['n-deam'] \
+ peaks233_50['q-deam'] + peaks233_50['k-hydr'] + peaks233_50['r-meth']

# calculate NAAF numerator for each peptide k
peaks233_50['NAAF num.'] = peaks233_50['Area'] / peaks233_50['stripped length']

# write modified dataframe to new csv file
peaks233_50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN50.csv")


# check out the results
peaks233_50.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,...,c-carb,m-oxid,n-deam,q-deam,k-hydr,r-meth,stripped peptide,stripped length,ptm-total,NAAF num.
0,2,14553,20170410__ETNP-233-265m-0.3um-JA4_01.raw,EN(+.98)LAALEK,8,98,8,444.7376,2,49.92,...,0,0,1,0,0,0,ENLAALEK,8,1,1112500.0
1,2,26516,20170410__ETNP-233-265m-0.3um-JA4_01.raw,VGC(+57.02)DEGLFEELPR,13,98,13,760.857,2,81.3,...,1,0,0,0,0,0,VGCDEGLFEELPR,13,1,723846.2
2,2,22453,20170410__ETNP-233-265m-0.3um-JA4_01.raw,WSVVFK,6,98,6,383.2187,2,72.23,...,0,0,0,0,0,0,WSVVFK,6,0,53666.67
3,2,19513,20170410__ETNP-233-265m-0.3um-JA4_01.raw,FDLLVNK,7,97,7,424.748,2,64.92,...,0,0,0,0,0,0,FDLLVNK,7,0,217142.9
4,2,13937,20170410__ETNP-233-265m-0.3um-JA4_01.raw,EPLGPVVR,8,97,8,433.7584,2,47.83,...,0,0,0,0,0,0,EPLGPVVR,8,0,502500.0


In [11]:
# made a new dataframe that contains the sums of certain columns in the modified
# peptide dataframe above (for >50% ALC)

index = ['sample total']

data = {'A': peaks233_50['A'].sum(),
        'C': peaks233_50['C'].sum(),
        'D': peaks233_50['D'].sum(),
        'E': peaks233_50['E'].sum(),
        'F': peaks233_50['F'].sum(),
        'G': peaks233_50['G'].sum(),
        'H': peaks233_50['H'].sum(),
        'I': peaks233_50['I'].sum(),
        'K': peaks233_50['K'].sum(),
        'L': peaks233_50['L'].sum(),
        'M': peaks233_50['M'].sum(),
        'N': peaks233_50['N'].sum(),
        'P': peaks233_50['P'].sum(),
        'Q': peaks233_50['Q'].sum(),
        'R': peaks233_50['R'].sum(),
        'S': peaks233_50['S'].sum(),
        'T': peaks233_50['T'].sum(),
        'V': peaks233_50['V'].sum(),
        'W': peaks233_50['W'].sum(),
        'Y': peaks233_50['Y'].sum(),
        'c-carb': peaks233_50['c-carb'].sum(),
        'm-oxid': peaks233_50['m-oxid'].sum(),
        'n-deam': peaks233_50['n-deam'].sum(),
        'q-deam': peaks233_50['q-deam'].sum(),
        'k-hydr': peaks233_50['k-hydr'].sum(),
        'r-meth': peaks233_50['r-meth'].sum(),
        'Total area': peaks233_50['Area'].sum(),
        'Total length': peaks233_50['stripped length'].sum()
       }

totalpeaks233_50 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L',\
                                               'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', \
                                               'm-oxid', 'n-deam', 'q-deam', 'k-hydr', 'r-meth', \
                                               'Total area', 'Total length'], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalpeaks233_50['% C w/ carb'] = totalpeaks233_50['c-carb'] / totalpeaks233_50['C'] 

# calculate percentage of M's that are oxidized
totalpeaks233_50['% M w/ oxid'] = totalpeaks233_50['m-oxid'] / totalpeaks233_50['M'] 

# calculate percentage of N's that are deamidated
totalpeaks233_50['% N w/ deam'] = totalpeaks233_50['n-deam'] / totalpeaks233_50['N'] 

# calculate percentage of Q's that are deamidated
totalpeaks233_50['% Q w/ deam'] = totalpeaks233_50['q-deam'] / totalpeaks233_50['Q'] 

# calculate percentage of K's that are hydroxylated
totalpeaks233_50['% K w/ hydr'] = totalpeaks233_50['k-hydr'] / totalpeaks233_50['K'] 

# calculate percentage of P's that are hydroxylated
#totalpeaks233_50['% P w/ hydr'] = totalpeaks233_50['p-hydr'] / totalpeaks233_50['K'] 

# calculate percentage of R's that are methylated
totalpeaks233_50['% R w/ meth'] = totalpeaks233_50['r-meth'] / totalpeaks233_50['R'] 

# calculate NAAF denominator for all peptides in dataset i
totalpeaks233_50['NAAF denom.'] = totalpeaks233_50['Total area'] / totalpeaks233_50['Total length']

# write modified dataframe to new txt file
totalpeaks233_50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN50_totals.csv")

totalpeaks233_50.head()

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,r-meth,Total area,Total length,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,% K w/ hydr,% R w/ meth,NAAF denom.
sample total,3021,591,1546,2121,1356,1648,1019,0,4038,4376,...,1085,9568026000.0,35570,1.0,0.402795,0.196872,0.020505,0.165676,0.341517,268991.466947


In [13]:
# use the calculated NAAF factor (in totalpeaks233 dataframe, above) to caluclate the NAAF 
# NAAF: normalized normalized area abundance factor

# don't have to worry here about DECOY hits messing with Area totals
# but we will with Comet results

NAAF50 = 268991.466947

# use NAAF >50% ALC to get NAAF
peaks233_50['NAAF factor'] = (peaks233_50['NAAF num.'])/NAAF50

# make a dataframe that contains only what we need: sequences, AAs, PTMs
peaks233_AA50 = peaks233_50[['stripped peptide', 'NAAF factor', 'A', 'C', 'D', 'E', 'F', 'G', 'H', \
                             'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', \
                             'm-oxid', 'n-deam', 'q-deam', 'k-hydr', 'r-meth']].copy()

# multiply the NAAF50 factor by the AAs to normalize its abundance by peak area and peptide length

peaks233_AA50['A-NAAF50'] = peaks233_AA50['A'] * peaks233_50['NAAF factor']
peaks233_AA50['C-NAAF50'] = peaks233_AA50['C'] * peaks233_50['NAAF factor']
peaks233_AA50['D-NAAF50'] = peaks233_AA50['D'] * peaks233_50['NAAF factor']
peaks233_AA50['E-NAAF50'] = peaks233_AA50['E'] * peaks233_50['NAAF factor']
peaks233_AA50['F-NAAF50'] = peaks233_AA50['F'] * peaks233_50['NAAF factor']
peaks233_AA50['G-NAAF50'] = peaks233_AA50['G'] * peaks233_50['NAAF factor']
peaks233_AA50['H-NAAF50'] = peaks233_AA50['H'] * peaks233_50['NAAF factor']
peaks233_AA50['I-NAAF50'] = peaks233_AA50['I'] * peaks233_50['NAAF factor']
peaks233_AA50['K-NAAF50'] = peaks233_AA50['K'] * peaks233_50['NAAF factor']
peaks233_AA50['L-NAAF50'] = peaks233_AA50['L'] * peaks233_50['NAAF factor']
peaks233_AA50['M-NAAF50'] = peaks233_AA50['M'] * peaks233_50['NAAF factor']
peaks233_AA50['N-NAAF50'] = peaks233_AA50['N'] * peaks233_50['NAAF factor']
peaks233_AA50['P-NAAF50'] = peaks233_AA50['P'] * peaks233_50['NAAF factor']
peaks233_AA50['Q-NAAF50'] = peaks233_AA50['Q'] * peaks233_50['NAAF factor']
peaks233_AA50['R-NAAF50'] = peaks233_AA50['R'] * peaks233_50['NAAF factor']
peaks233_AA50['S-NAAF50'] = peaks233_AA50['S'] * peaks233_50['NAAF factor']
peaks233_AA50['T-NAAF50'] = peaks233_AA50['T'] * peaks233_50['NAAF factor']
peaks233_AA50['V-NAAF50'] = peaks233_AA50['V'] * peaks233_50['NAAF factor']
peaks233_AA50['W-NAAF50'] = peaks233_AA50['W'] * peaks233_50['NAAF factor']
peaks233_AA50['Y-NAAF50'] = peaks233_AA50['Y'] * peaks233_50['NAAF factor']

# multiply the NAAF50 factor by the PTMs normalize its abundance by peak area and peptide length

peaks233_AA50['ccarb-NAAF50'] = peaks233_AA50['c-carb'] * peaks233_AA50['NAAF factor']
peaks233_AA50['moxid-NAAF50'] = peaks233_AA50['m-oxid'] * peaks233_AA50['NAAF factor']
peaks233_AA50['ndeam-NAAF50'] = peaks233_AA50['n-deam'] * peaks233_AA50['NAAF factor']
peaks233_AA50['qdeam-NAAF50'] = peaks233_AA50['q-deam'] * peaks233_AA50['NAAF factor']
peaks233_AA50['khydr-NAAF50'] = peaks233_AA50['k-hydr'] * peaks233_AA50['NAAF factor']
peaks233_AA50['rmeth-NAAF50'] = peaks233_AA50['r-meth'] * peaks233_AA50['NAAF factor']

# write the dataframe to a new csv
peaks233_AA50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN50_NAAF.csv")

peaks233_AA50.head()

Unnamed: 0,stripped peptide,NAAF factor,A,C,D,E,F,G,H,I,...,T-NAAF50,V-NAAF50,W-NAAF50,Y-NAAF50,ccarb-NAAF50,moxid-NAAF50,ndeam-NAAF50,qdeam-NAAF50,khydr-NAAF50,rmeth-NAAF50
0,ENLAALEK,4.135819,2,0,0,2,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.135819,0.0,0.0,0.0
1,VGCDEGLFEELPR,2.690963,0,1,1,3,1,2,0,0,...,0.0,2.690963,0.0,0.0,2.690963,0.0,0.0,0.0,0.0,0.0
2,WSVVFK,0.199511,0,0,0,0,1,0,0,0,...,0.0,0.399021,0.199511,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,FDLLVNK,0.807248,0,0,1,0,1,0,0,0,...,0.0,0.807248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,EPLGPVVR,1.868089,0,0,0,1,0,1,0,0,...,0.0,3.736178,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# made a new dataframe that contains the sums of NAAF normalized AAs for comet233 PeaksDN results
# also contains the sums of the NAAF-corrected PTMs occurances for each affected residue

index = ['sample total']

data = {'NAAF': peaks233_AA50['NAAF factor'].sum(),
        'A-NAAF': peaks233_AA50['A-NAAF50'].sum(),
        'C-NAAF': peaks233_AA50['C-NAAF50'].sum(),
        'D-NAAF': peaks233_AA50['D-NAAF50'].sum(),
        'E-NAAF': peaks233_AA50['E-NAAF50'].sum(),
        'F-NAAF': peaks233_AA50['F-NAAF50'].sum(),
        'G-NAAF': peaks233_AA50['G-NAAF50'].sum(),
        'H-NAAF': peaks233_AA50['H-NAAF50'].sum(),
        'I-NAAF': peaks233_AA50['I-NAAF50'].sum(),
        'K-NAAF': peaks233_AA50['K-NAAF50'].sum(),
        'L-NAAF': peaks233_AA50['L-NAAF50'].sum(),
        'M-NAAF': peaks233_AA50['M-NAAF50'].sum(),
        'N-NAAF': peaks233_AA50['N-NAAF50'].sum(),
        'P-NAAF': peaks233_AA50['P-NAAF50'].sum(),
        'Q-NAAF': peaks233_AA50['Q-NAAF50'].sum(),
        'R-NAAF': peaks233_AA50['R-NAAF50'].sum(),
        'S-NAAF': peaks233_AA50['S-NAAF50'].sum(),
        'T-NAAF': peaks233_AA50['T-NAAF50'].sum(),
        'V-NAAF': peaks233_AA50['V-NAAF50'].sum(),
        'W-NAAF': peaks233_AA50['W-NAAF50'].sum(),
        'Y-NAAF': peaks233_AA50['Y-NAAF50'].sum(),
        'C-carb-NAAF': peaks233_AA50['ccarb-NAAF50'].sum(),
        'M-oxid-NAAF': peaks233_AA50['moxid-NAAF50'].sum(),
        'N-deam-NAAF': peaks233_AA50['ndeam-NAAF50'].sum(),
        'Q-deam-NAAF': peaks233_AA50['qdeam-NAAF50'].sum(),
        'K-hydr-NAAF': peaks233_AA50['khydr-NAAF50'].sum(),
        'R-meth-NAAF': peaks233_AA50['rmeth-NAAF50'].sum()
       }

totalpeaks233_AA50 = pd.DataFrame(data, columns=['NAAF', 'A-NAAF', 'C-NAAF', 'D-NAAF', 'E-NAAF', 'F-NAAF', \
                                                   'G-NAAF', 'H-NAAF', 'I-NAAF', 'K-NAAF', 'L-NAAF', 'M-NAAF', \
                                                   'N-NAAF', 'P-NAAF', 'Q-NAAF', 'R-NAAF', 'S-NAAF', \
                                                   'T-NAAF', 'V-NAAF', 'W-NAAF', 'Y-NAAF', 'C-carb-NAAF', \
                                                   'M-oxid-NAAF', 'N-deam-NAAF', 'Q-deam-NAAF', 'K-hydr-NAAF',\
                                                   'R-meth-NAAF'], index=index)

# calculate the NAAF-corrected % modified C, M, N, Q, K, P, and Rs


totalpeaks233_AA50['% C w/ carb. NAAF'] = totalpeaks233_AA50['C-carb-NAAF'] / totalpeaks233_AA50['C-NAAF']
totalpeaks233_AA50['% M w/ oxid. NAAF'] = totalpeaks233_AA50['M-oxid-NAAF'] / totalpeaks233_AA50['M-NAAF']
totalpeaks233_AA50['% N w/ deam. NAAF'] = totalpeaks233_AA50['N-deam-NAAF'] / totalpeaks233_AA50['N-NAAF']
totalpeaks233_AA50['% Q w/ deam. NAAF'] = totalpeaks233_AA50['Q-deam-NAAF'] / totalpeaks233_AA50['Q-NAAF']
totalpeaks233_AA50['% K w/ hydr. NAAF'] = totalpeaks233_AA50['K-hydr-NAAF'] / totalpeaks233_AA50['K-NAAF']
totalpeaks233_AA50['% R w/ meth. NAAF'] = totalpeaks233_AA50['R-meth-NAAF'] / totalpeaks233_AA50['R-NAAF']

# calculate NAAF summed numerator over denominator (in above cell) for all peptides in dataset i: a check
totalpeaks233_AA50['NAAF check'] = totalpeaks233_AA50['NAAF'] / 268991.466947

# write modified dataframe to new txt file, same name + totals
totalpeaks233_AA50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_DN50_NAAF_totals.csv")

totalpeaks233_AA50.head()

Unnamed: 0,NAAF,A-NAAF,C-NAAF,D-NAAF,E-NAAF,F-NAAF,G-NAAF,H-NAAF,I-NAAF,K-NAAF,...,Q-deam-NAAF,K-hydr-NAAF,R-meth-NAAF,% C w/ carb. NAAF,% M w/ oxid. NAAF,% N w/ deam. NAAF,% Q w/ deam. NAAF,% K w/ hydr. NAAF,% R w/ meth. NAAF,NAAF check
sample total,4469.010808,3847.40697,672.870558,656.326796,1651.830591,532.272591,1281.508363,421.829531,0.0,1708.434088,...,112.51498,484.359101,1111.94236,1.0,0.295306,0.463614,0.143355,0.283511,0.249789,0.016614


### Same process but for de novo peptide >80 % ALC:

In [15]:
# use a count function to enumerate the # of A's (alanines) in each peptide
peaks233_80['A'] = peaks233_80['Peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
peaks233_80['C'] = peaks233_80['Peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
peaks233_80['D'] = peaks233_80['Peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
peaks233_80['E'] = peaks233_80['Peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
peaks233_80['F'] = peaks233_80['Peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
peaks233_80['G'] = peaks233_80['Peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
peaks233_80['H'] = peaks233_80['Peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in peaks233_80 output, there will be no isoleucines (they're lumped in with leucines)
peaks233_80['I'] = peaks233_80['Peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
peaks233_80['K'] = peaks233_80['Peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
peaks233_80['L'] = peaks233_80['Peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
peaks233_80['M'] = peaks233_80['Peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
peaks233_80['N'] = peaks233_80['Peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
peaks233_80['P'] = peaks233_80['Peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
peaks233_80['Q'] = peaks233_80['Peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
peaks233_80['R'] = peaks233_80['Peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
peaks233_80['S'] = peaks233_80['Peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
peaks233_80['T'] = peaks233_80['Peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
peaks233_80['V'] = peaks233_80['Peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
peaks233_80['W'] = peaks233_80['Peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
peaks233_80['Y'] = peaks233_80['Peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
peaks233_80['c-carb'] = peaks233_80['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks233_80['m-oxid'] = peaks233_80['Peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
peaks233_80['n-deam'] = peaks233_80['Peptide'].apply(lambda x: x.count('N(+.98)'))

# use a count function to enumerate the # of deamidated Q's in each peptide
peaks233_80['q-deam'] = peaks233_80['Peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of hydroxylated K's in each peptide
peaks233_80['k-hydr'] = peaks233_80['Peptide'].apply(lambda x: x.count('K(+15.99)'))

# use a count function to enumerate the # of hydroxylated P's in each peptide
#peaks233_80['p-hydr'] = peaks233_80['Peptide'].apply(lambda x: x.count('P(+15.99)'))

# use a count function to enumerate the # of methylated R's in each peptide
peaks233_80['r-meth'] = peaks233_80['Peptide'].apply(lambda x: x.count('R(+14.02)'))

# create a column with 'stripped' peptide sequences using strip
peaks233_80['stripped peptide'] = peaks233_80['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks233_80['stripped length'] = peaks233_80['stripped peptide'].apply(len)

# total the number of modifications in sequence
peaks233_80['ptm-total'] = peaks233_80['c-carb'] + peaks233_80['m-oxid'] + peaks233_80['n-deam'] + \
peaks233_80['q-deam'] + peaks233_80['k-hydr'] + peaks233_80['r-meth']

# calculate NAAF numerator for each peptide k
peaks233_80['NAAF num.'] = peaks233_80['Area'] / peaks233_80['stripped length']

# write modified dataframe to new csv file
peaks233_80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN80.csv")

# check out the results
peaks233_80.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,...,c-carb,m-oxid,n-deam,q-deam,k-hydr,r-meth,stripped peptide,stripped length,ptm-total,NAAF num.
0,2,14553,20170410__ETNP-233-265m-0.3um-JA4_01.raw,EN(+.98)LAALEK,8,98,8,444.7376,2,49.92,...,0,0,1,0,0,0,ENLAALEK,8,1,1112500.0
1,2,26516,20170410__ETNP-233-265m-0.3um-JA4_01.raw,VGC(+57.02)DEGLFEELPR,13,98,13,760.857,2,81.3,...,1,0,0,0,0,0,VGCDEGLFEELPR,13,1,723846.2
2,2,22453,20170410__ETNP-233-265m-0.3um-JA4_01.raw,WSVVFK,6,98,6,383.2187,2,72.23,...,0,0,0,0,0,0,WSVVFK,6,0,53666.67
3,2,19513,20170410__ETNP-233-265m-0.3um-JA4_01.raw,FDLLVNK,7,97,7,424.748,2,64.92,...,0,0,0,0,0,0,FDLLVNK,7,0,217142.9
4,2,13937,20170410__ETNP-233-265m-0.3um-JA4_01.raw,EPLGPVVR,8,97,8,433.7584,2,47.83,...,0,0,0,0,0,0,EPLGPVVR,8,0,502500.0


In [16]:
# made a new dataframe that contains the sums of certain columns in modified 
#peptide dataframe above (for >80% ALC)

index = ['sample total']

data = {'A': peaks233_80['A'].sum(),
        'C': peaks233_80['C'].sum(),
        'D': peaks233_80['D'].sum(),
        'E': peaks233_80['E'].sum(),
        'F': peaks233_80['F'].sum(),
        'G': peaks233_80['G'].sum(),
        'H': peaks233_80['H'].sum(),
        'I': peaks233_80['I'].sum(),
        'K': peaks233_80['K'].sum(),
        'L': peaks233_80['L'].sum(),
        'M': peaks233_80['M'].sum(),
        'N': peaks233_80['N'].sum(),
        'P': peaks233_80['P'].sum(),
        'Q': peaks233_80['Q'].sum(),
        'R': peaks233_80['R'].sum(),
        'S': peaks233_80['S'].sum(),
        'T': peaks233_80['T'].sum(),
        'V': peaks233_80['V'].sum(),
        'W': peaks233_80['W'].sum(),
        'Y': peaks233_80['Y'].sum(),
        'c-carb': peaks233_80['c-carb'].sum(),
        'm-oxid': peaks233_80['m-oxid'].sum(),
        'n-deam': peaks233_80['n-deam'].sum(),
        'q-deam': peaks233_80['q-deam'].sum(),
        'k-hydr': peaks233_80['k-hydr'].sum(),
        'r-meth': peaks233_80['r-meth'].sum(),
        'Total area': peaks233_80['Area'].sum(),
        'Total length': peaks233_80['stripped length'].sum()
       }

totalpeaks233_80 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M',\
                                               'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb',\
                                               'm-oxid', 'n-deam', 'q-deam', 'k-hydr', 'r-meth',\
                                               'Total area', 'Total length'], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalpeaks233_80['% C w/ carb'] = totalpeaks233_80['c-carb'] / totalpeaks233_80['C'] 

# calculate percentage of M's that are oxidized
totalpeaks233_80['% M w/ oxid'] = totalpeaks233_80['m-oxid'] / totalpeaks233_80['M'] 

# calculate percentage of N's that are deamidated
totalpeaks233_80['% N w/ deam'] = totalpeaks233_80['n-deam'] / totalpeaks233_80['N'] 

# calculate percentage of Q's that are deamidated
totalpeaks233_80['% Q w/ deam'] = totalpeaks233_80['q-deam'] / totalpeaks233_80['Q'] 

# calculate percentage of K's that are hydroxylated
totalpeaks233_80['% K w/ hydr'] = totalpeaks233_80['k-hydr'] / totalpeaks233_80['K'] 

# calculate percentage of P's that are hydroxylated
#totalpeaks233_80['% P w/ hydr'] = totalpeaks233_80['p-hydr'] / totalpeaks233_80['K'] 

# calculate percentage of R's that are methylated
totalpeaks233_80['% R w/ meth'] = totalpeaks233_80['r-meth'] / totalpeaks233_80['R'] 

# calculate NAAF denominator for all peptides in dataset i
totalpeaks233_80['NAAF denom.'] = totalpeaks233_80['Total area'] / totalpeaks233_80['Total length']

# write modified dataframe to new txt file
totalpeaks233_80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN80_totals.csv")

totalpeaks233_80.head()

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,r-meth,Total area,Total length,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,% K w/ hydr,% R w/ meth,NAAF denom.
sample total,637,57,255,448,258,280,67,0,535,940,...,93,7082521000.0,6274,1.0,0.508929,0.241117,0.030303,0.076636,0.207127,1128868.0


In [17]:
# use the calculated NAAF factor (in totalpeaks233 dataframe, above) to caluclate the NAAF 
# NAAF: normalized normalized area abundance factor

# don't have to worry here about DECOY hits messing with Area totals
# but we will with Comet results

NAAF80 = 1.128868e+06

# use NAAF >80% ALC to get NAAF
peaks233_80['NAAF factor'] = (peaks233_80['NAAF num.'])/NAAF80

# make a dataframe that contains only what we need: sequences, AAs, PTMs
peaks233_AA80 = peaks233_80[['stripped peptide', 'NAAF factor', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', \
                             'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', \
                             'n-deam', 'q-deam', 'k-hydr', 'r-meth']].copy()

# multiply the NAAF80 factor by the AAs to normalize its abundance by peak area and peptide length

peaks233_AA80['A-NAAF80'] = peaks233_AA80['A'] * peaks233_80['NAAF factor']
peaks233_AA80['C-NAAF80'] = peaks233_AA80['C'] * peaks233_80['NAAF factor']
peaks233_AA80['D-NAAF80'] = peaks233_AA80['D'] * peaks233_80['NAAF factor']
peaks233_AA80['E-NAAF80'] = peaks233_AA80['E'] * peaks233_80['NAAF factor']
peaks233_AA80['F-NAAF80'] = peaks233_AA80['F'] * peaks233_80['NAAF factor']
peaks233_AA80['G-NAAF80'] = peaks233_AA80['G'] * peaks233_80['NAAF factor']
peaks233_AA80['H-NAAF80'] = peaks233_AA80['H'] * peaks233_80['NAAF factor']
peaks233_AA80['I-NAAF80'] = peaks233_AA80['I'] * peaks233_80['NAAF factor']
peaks233_AA80['K-NAAF80'] = peaks233_AA80['K'] * peaks233_80['NAAF factor']
peaks233_AA80['L-NAAF80'] = peaks233_AA80['L'] * peaks233_80['NAAF factor']
peaks233_AA80['M-NAAF80'] = peaks233_AA80['M'] * peaks233_80['NAAF factor']
peaks233_AA80['N-NAAF80'] = peaks233_AA80['N'] * peaks233_80['NAAF factor']
peaks233_AA80['P-NAAF80'] = peaks233_AA80['P'] * peaks233_80['NAAF factor']
peaks233_AA80['Q-NAAF80'] = peaks233_AA80['Q'] * peaks233_80['NAAF factor']
peaks233_AA80['R-NAAF80'] = peaks233_AA80['R'] * peaks233_80['NAAF factor']
peaks233_AA80['S-NAAF80'] = peaks233_AA80['S'] * peaks233_80['NAAF factor']
peaks233_AA80['T-NAAF80'] = peaks233_AA80['T'] * peaks233_80['NAAF factor']
peaks233_AA80['V-NAAF80'] = peaks233_AA80['V'] * peaks233_80['NAAF factor']
peaks233_AA80['W-NAAF80'] = peaks233_AA80['W'] * peaks233_80['NAAF factor']
peaks233_AA80['Y-NAAF80'] = peaks233_AA80['Y'] * peaks233_80['NAAF factor']

# multiply the NAAF80 factor by the PTMs normalize its abundance by peak area and peptide length

peaks233_AA80['ccarb-NAAF80'] = peaks233_AA80['c-carb'] * peaks233_AA80['NAAF factor']
peaks233_AA80['moxid-NAAF80'] = peaks233_AA80['m-oxid'] * peaks233_AA80['NAAF factor']
peaks233_AA80['ndeam-NAAF80'] = peaks233_AA80['n-deam'] * peaks233_AA80['NAAF factor']
peaks233_AA80['qdeam-NAAF80'] = peaks233_AA80['q-deam'] * peaks233_AA80['NAAF factor']
peaks233_AA80['khydr-NAAF80'] = peaks233_AA80['k-hydr'] * peaks233_AA80['NAAF factor']
peaks233_AA80['rmeth-NAAF80'] = peaks233_AA80['r-meth'] * peaks233_AA80['NAAF factor']

# write the dataframe to a new csv
peaks233_AA80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN80_NAAF.csv")

peaks233_AA80.head()

Unnamed: 0,stripped peptide,NAAF factor,A,C,D,E,F,G,H,I,...,T-NAAF80,V-NAAF80,W-NAAF80,Y-NAAF80,ccarb-NAAF80,moxid-NAAF80,ndeam-NAAF80,qdeam-NAAF80,khydr-NAAF80,rmeth-NAAF80
0,ENLAALEK,0.985501,2,0,0,2,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.985501,0.0,0.0,0.0
1,VGCDEGLFEELPR,0.641214,0,1,1,3,1,2,0,0,...,0.0,0.641214,0.0,0.0,0.641214,0.0,0.0,0.0,0.0,0.0
2,WSVVFK,0.04754,0,0,0,0,1,0,0,0,...,0.0,0.09508,0.04754,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,FDLLVNK,0.192355,0,0,1,0,1,0,0,0,...,0.0,0.192355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,EPLGPVVR,0.445136,0,0,0,1,0,1,0,0,...,0.0,0.890272,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# made a new dataframe that contains the sums of NAAF normalized AAs for comet233 PeaksDN results
# also contains the sums of the NAAF-corrected PTMs occurances for each affected residue

index = ['sample total']

data = {'NAAF': peaks233_AA80['NAAF factor'].sum(),
        'A-NAAF': peaks233_AA80['A-NAAF80'].sum(),
        'C-NAAF': peaks233_AA80['C-NAAF80'].sum(),
        'D-NAAF': peaks233_AA80['D-NAAF80'].sum(),
        'E-NAAF': peaks233_AA80['E-NAAF80'].sum(),
        'F-NAAF': peaks233_AA80['F-NAAF80'].sum(),
        'G-NAAF': peaks233_AA80['G-NAAF80'].sum(),
        'H-NAAF': peaks233_AA80['H-NAAF80'].sum(),
        'I-NAAF': peaks233_AA80['I-NAAF80'].sum(),
        'K-NAAF': peaks233_AA80['K-NAAF80'].sum(),
        'L-NAAF': peaks233_AA80['L-NAAF80'].sum(),
        'M-NAAF': peaks233_AA80['M-NAAF80'].sum(),
        'N-NAAF': peaks233_AA80['N-NAAF80'].sum(),
        'P-NAAF': peaks233_AA80['P-NAAF80'].sum(),
        'Q-NAAF': peaks233_AA80['Q-NAAF80'].sum(),
        'R-NAAF': peaks233_AA80['R-NAAF80'].sum(),
        'S-NAAF': peaks233_AA80['S-NAAF80'].sum(),
        'T-NAAF': peaks233_AA80['T-NAAF80'].sum(),
        'V-NAAF': peaks233_AA80['V-NAAF80'].sum(),
        'W-NAAF': peaks233_AA80['W-NAAF80'].sum(),
        'Y-NAAF': peaks233_AA80['Y-NAAF80'].sum(),
        'C-carb-NAAF': peaks233_AA80['ccarb-NAAF80'].sum(),
        'M-oxid-NAAF': peaks233_AA80['moxid-NAAF80'].sum(),
        'N-deam-NAAF': peaks233_AA80['ndeam-NAAF80'].sum(),
        'Q-deam-NAAF': peaks233_AA80['qdeam-NAAF80'].sum(),
        'K-hydr-NAAF': peaks233_AA80['khydr-NAAF80'].sum(),
        'R-meth-NAAF': peaks233_AA80['rmeth-NAAF80'].sum()
       }

totalpeaks233_AA80 = pd.DataFrame(data, columns=['NAAF', 'A-NAAF', 'C-NAAF', 'D-NAAF', 'E-NAAF', 'F-NAAF', \
                                                   'G-NAAF', 'H-NAAF', 'I-NAAF', 'K-NAAF', 'L-NAAF', 'M-NAAF', \
                                                   'N-NAAF', 'P-NAAF', 'Q-NAAF', 'R-NAAF', 'S-NAAF', \
                                                   'T-NAAF', 'V-NAAF', 'W-NAAF', 'Y-NAAF', 'C-carb-NAAF', \
                                                   'M-oxid-NAAF', 'N-deam-NAAF', 'Q-deam-NAAF', 'K-hydr-NAAF',\
                                                   'R-meth-NAAF'], index=index)

# calculate the NAAF-corrected % modified C, M, N, Q, K, P, and Rs


totalpeaks233_AA80['% C w/ carb. NAAF'] = totalpeaks233_AA80['C-carb-NAAF'] / totalpeaks233_AA80['C-NAAF']
totalpeaks233_AA80['% M w/ oxid. NAAF'] = totalpeaks233_AA80['M-oxid-NAAF'] / totalpeaks233_AA80['M-NAAF']
totalpeaks233_AA80['% N w/ deam. NAAF'] = totalpeaks233_AA80['N-deam-NAAF'] / totalpeaks233_AA80['N-NAAF']
totalpeaks233_AA80['% Q w/ deam. NAAF'] = totalpeaks233_AA80['Q-deam-NAAF'] / totalpeaks233_AA80['Q-NAAF']
totalpeaks233_AA80['% K w/ hydr. NAAF'] = totalpeaks233_AA80['K-hydr-NAAF'] / totalpeaks233_AA80['K-NAAF']
totalpeaks233_AA80['% R w/ meth. NAAF'] = totalpeaks233_AA80['R-meth-NAAF'] / totalpeaks233_AA80['R-NAAF']

# calculate NAAF summed numerator over denominator (in above cell) for all peptides in dataset i: a check
totalpeaks233_AA80['NAAF check'] = totalpeaks233_AA80['NAAF'] / 1.128868e+06

# write modified dataframe to new txt file, same name + totals
totalpeaks233_AA80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN80_NAAF_totals.csv")

totalpeaks233_AA80.head()

Unnamed: 0,NAAF,A-NAAF,C-NAAF,D-NAAF,E-NAAF,F-NAAF,G-NAAF,H-NAAF,I-NAAF,K-NAAF,...,Q-deam-NAAF,K-hydr-NAAF,R-meth-NAAF,% C w/ carb. NAAF,% M w/ oxid. NAAF,% N w/ deam. NAAF,% Q w/ deam. NAAF,% K w/ hydr. NAAF,% R w/ meth. NAAF,NAAF check
sample total,729.042031,659.799741,64.905681,65.848432,256.832922,63.603617,148.393768,8.960728,0.0,169.382693,...,25.855878,35.800928,70.427309,1.0,0.1,0.585299,0.224428,0.211361,0.10607,0.000646


### Visualizing the results

In [19]:
print("ALC max: ", peaks233_80['ALC (%)'].max())
print("ALC min: ", peaks233_80['ALC (%)'].min())

ALC max:  98
ALC min:  80


### Exporting txt files of stripped peptides at confidence cutoffs:

In [20]:
# keep only peptide column >50% ALC
pep233_50_moddup = peaks233_50[["Peptide"]]

# keep only the stripped peptide column for the >50% ALC
# this is what we'll use for UniPept input, etc
pep233_50_dup = peaks233_50[["stripped peptide"]]

# deduplicate both of these lists
pep233_50_mod = pep233_50_moddup.drop_duplicates()
pep233_50 = pep233_50_dup.drop_duplicates()

# print out the #s of modified and stripped peptides, deduplicated and not

print("Total modified peptides in 233:", len(pep233_50_moddup))
print("Deduplicated modified peptides in 233:", len(pep233_50_mod))
print("Total strippled peptides in 233:", len(pep233_50_dup))
print("Deduplicated stripped peptides in 233:", len(pep233_50))

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep233_50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN50_stripped_peptides.txt", header=False, index=False)

# made the text file into a FASTA 

!awk '{print ">"NR"\n"$0}' /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN50_stripped_peptides.txt > \
/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN50_stripped_peptides.fas


Total modified peptides in 233: 4235
Deduplicated modified peptides in 233: 4144
Total strippled peptides in 233: 4235
Deduplicated stripped peptides in 233: 4048


In [21]:
# keep only peptide column >80% ALC
pep233_80_moddup = peaks233_80[["Peptide"]]

# keep only the stripped peptide column for the >80% ALC
# this is what we'll use for UniPept input, etc
pep233_80_dup = peaks233_80[["stripped peptide"]]

# deduplicate both of these lists
pep233_80_mod = pep233_80_moddup.drop_duplicates()
pep233_80 = pep233_80_dup.drop_duplicates()

# print out the #s of modified and stripped peptides, deduplicated and not

print("Total modified peptides in 233:", len(pep233_80_moddup))
print("Deduplicated modified peptides in 233:", len(pep233_80_mod))
print("Total strippled peptides in 233:", len(pep233_80_dup))
print("Deduplicated stripped peptides in 233:", len(pep233_80))

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep233_80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN80_stripped_peptides.txt", header=False, index=False)

# made the text file into a FASTA 

!awk '{print ">"NR"\n"$0}' /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN80_stripped_peptides.txt > \
/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PeaksDN-PTMopt/233/ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_15ppm_DN80_stripped_peptides.fas


Total modified peptides in 233: 730
Deduplicated modified peptides in 233: 686
Total strippled peptides in 233: 730
Deduplicated stripped peptides in 233: 684
