### Manipulation of PEAKS de novo results of _T. weiss_  LC-MS/MS data using Python.

Starting with:

    PEAKS de novo results (.csv) of PTM-optimized sequencing >80% ALC
    from Thermo Fusion tribrid runs at the UW Proteomics Resource center, January 2021
    combined from multiple injections

Goal:

    Files with stripped (no PTMs) peptide lists and
    Columns with #'s of each modification in every sequence
    Column with stripped peptide lengths (# amino acids)
    Files with peptides and PTMs for PTM+cellular compartment x-analysis
    
### To use:

#### 1. Change the input file name in *IN 4*
#### 2. Change output file name in *IN 6*, *IN 7*, *IN 8*

In [1]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [2]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_330-T2nd-all_DENOVO_92/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_330-T2nd-all_DENOVO_92


In [3]:
# read the CSVs into a dataframe using the read_csv function and call 'peaks'

peaks330 = pd.read_csv("/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_330-T2nd-all_DENOVO_92/330-T2nd-all_DENOVO_92_DN.csv")

print("# redundant Peaks peptides >80% ALC in combined dataframe:", len(peaks330))

mean_len = peaks330['length'].mean()
print(mean_len)

# look at the dataframe
peaks330.head()

# redundant Peaks peptides >80% ALC in combined dataframe: 3647
8.642171647929805


Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,18,17882,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,EEVEKELEDR,10,99,10,638.3072,2,58.0,3110000.0,1274.5989,0.7,,100 100 100 100 99 100 99 100 99 96,EEVEKELEDR,HCD
1,18,19538,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,KQELEDLTK,9,98,9,552.3014,2,62.5,4610000.0,1102.5869,1.2,,95 98 100 97 99 99 100 100 98,KQELEDLTK,HCD
2,18,5272,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,DEETKLSK,8,98,8,475.2462,2,20.33,28700000.0,948.4764,1.5,,92 99 100 99 99 100 100 99,DEETKLSK,HCD
3,18,18937,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,TGVFLKT,7,98,7,383.2291,2,60.84,3470000.0,764.4432,0.6,,98 98 99 99 99 98 97,TGVFLKT,HCD
4,18,5718,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,DEETKLSK,8,98,8,475.2461,2,21.35,8530000.0,948.4764,1.3,,92 99 100 99 99 100 100 99,DEETKLSK,HCD


The peptide column has the masses of modifications (e.g., 57.02 Da for carbamidomethylation of cysteine). We want to make new columns with all that information and make a column with only the 'stripped' peptide sequence that's just amino acids - this we can then align against other sequences, for example.

Modified residues were allowed for:

    fixed carbamodimethylation of cysteine 57.021464 C
    varialbe oxidation of methionine, lysine, proline, arginine, tyrosine: 15.9949 MKPRY
    variable deamidation of asparagine, glumatine: 0.984016 NQ
    variable methylation of lysine and arginine: 14.015650 KR
    variable pyro-glu from glutamine: -17.03 Q
    variable acetylation of lysine: 42.01 K


We'll then write this manipulated dataframe to a new file.

In [4]:
# use a count function to enumerate the # of A's (alanines) in each peptide
peaks330['A'] = peaks330['Peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
peaks330['C'] = peaks330['Peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
peaks330['D'] = peaks330['Peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
peaks330['E'] = peaks330['Peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
peaks330['F'] = peaks330['Peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
peaks330['G'] = peaks330['Peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
peaks330['H'] = peaks330['Peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in peaks330 output, there will be no isoleucines (they're lumped in with leucines)
peaks330['I'] = peaks330['Peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
peaks330['K'] = peaks330['Peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
peaks330['L'] = peaks330['Peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
peaks330['M'] = peaks330['Peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
peaks330['N'] = peaks330['Peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
peaks330['P'] = peaks330['Peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
peaks330['Q'] = peaks330['Peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
peaks330['R'] = peaks330['Peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
peaks330['S'] = peaks330['Peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
peaks330['T'] = peaks330['Peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
peaks330['V'] = peaks330['Peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
peaks330['W'] = peaks330['Peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
peaks330['Y'] = peaks330['Peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
peaks330['c-carb'] = peaks330['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks330['m-oxid'] = peaks330['Peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a count function to enumerate the # of oxidized K's in each peptide
peaks330['k-oxid'] = peaks330['Peptide'].apply(lambda x: x.count('K(+15.99)'))

# use a count function to enumerate the # of oxidized P's in each peptide
peaks330['p-oxid'] = peaks330['Peptide'].apply(lambda x: x.count('P(+15.99)'))

# use a count function to enumerate the # of oxidized R's in each peptide
peaks330['r-oxid'] = peaks330['Peptide'].apply(lambda x: x.count('R(+15.99)'))

# use a count function to enumerate the # of oxidized Y's in each peptide
peaks330['y-oxid'] = peaks330['Peptide'].apply(lambda x: x.count('Y(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
peaks330['n-deam'] = peaks330['Peptide'].apply(lambda x: x.count('N(+.98)'))

# use a count function to enumerate the # of deamidated Q's in each peptide
#peaks330['q-deam'] = peaks330['Peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of methylated K's in each peptide
peaks330['k-meth'] = peaks330['Peptide'].apply(lambda x: x.count('K(+14.02)'))

# use a count function to enumerate the # of methylated R's in each peptide
peaks330['r-meth'] = peaks330['Peptide'].apply(lambda x: x.count('R(+14.02)'))

# use a count function to enumerate the # of pyro glu Q's in each peptide
peaks330['q-pyro'] = peaks330['Peptide'].apply(lambda x: x.count('Q(-17.03)'))

# use a count function to enumerate the # of acetylation of K's in each peptide
peaks330['k-acet'] = peaks330['Peptide'].apply(lambda x: x.count('K(+42.01)'))

# create a column with 'stripped' peptide sequences using strip
peaks330['stripped peptide'] = peaks330['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks330['stripped length'] = peaks330['stripped peptide'].apply(len)

# total the number of modifications in sequence
peaks330['ptm-total'] = peaks330['c-carb'] + peaks330['m-oxid'] + peaks330['k-oxid'] + peaks330['p-oxid'] \
+ peaks330['r-oxid'] + peaks330['y-oxid'] + peaks330['n-deam'] + peaks330['k-meth'] + peaks330['r-meth'] \
+ peaks330['q-pyro'] + peaks330['k-acet']

# calculate NAAF numerator for each peptide k
peaks330['NAAF num.'] = peaks330['Area'] / peaks330['stripped length']

# write modified dataframe to new txt file, same name + 'stripped'
peaks330.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PeaksDN/TW_330_T2_undigested_combine_PTMopt_DN80.csv")

# check out the results
peaks330.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,...,y-oxid,n-deam,k-meth,r-meth,q-pyro,k-acet,stripped peptide,stripped length,ptm-total,NAAF num.
0,18,17882,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,EEVEKELEDR,10,99,10,638.3072,2,58.0,...,0,0,0,0,0,0,EEVEKELEDR,10,0,311000.0
1,18,19538,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,KQELEDLTK,9,98,9,552.3014,2,62.5,...,0,0,0,0,0,0,KQELEDLTK,9,0,512222.2
2,18,5272,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,DEETKLSK,8,98,8,475.2462,2,20.33,...,0,0,0,0,0,0,DEETKLSK,8,0,3587500.0
3,18,18937,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,TGVFLKT,7,98,7,383.2291,2,60.84,...,0,0,0,0,0,0,TGVFLKT,7,0,495714.3
4,18,5718,20210114_Weissrot_330_T2_nodigest_DDA_120min_2...,DEETKLSK,8,98,8,475.2461,2,21.35,...,0,0,0,0,0,0,DEETKLSK,8,0,1066250.0


### Export txt file of entire (with modification terms) peptides only

In [5]:
# keep only peptide list with mods
dn_pep_330 = peaks330[['Peptide']]

# deduplice the lists
dn_modpep_330 = dn_pep_330.drop_duplicates()

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing
dn_modpep_330.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PTM-cellular-compartment/to-combine/T2/TW_330_T2_trypsin_noenz_combine_PTMopt_DN_mod_peptides.txt", header=False, index=False)

# look at the stripped peptides
dn_modpep_330.head()

Unnamed: 0,Peptide
0,EEVEKELEDR
1,KQELEDLTK
2,DEETKLSK
3,TGVFLKT
5,EEVEQELEK(+42.01)T


In [7]:
# made a new dataframe that contains the sums of certain columns 
# in the stripped peptide dataframe above (for >80% ALC)

index = ['sample total']

data = {'A': peaks330['A'].sum(),
        'C': peaks330['C'].sum(),
        'D': peaks330['D'].sum(),
        'E': peaks330['E'].sum(),
        'F': peaks330['F'].sum(),
        'G': peaks330['G'].sum(),
        'H': peaks330['H'].sum(),
        'I': peaks330['I'].sum(),
        'K': peaks330['K'].sum(),
        'L': peaks330['L'].sum(),
        'M': peaks330['M'].sum(),
        'N': peaks330['N'].sum(),
        'P': peaks330['P'].sum(),
        'Q': peaks330['Q'].sum(),
        'R': peaks330['R'].sum(),
        'S': peaks330['S'].sum(),
        'T': peaks330['T'].sum(),
        'V': peaks330['V'].sum(),
        'W': peaks330['W'].sum(),
        'Y': peaks330['Y'].sum(),
        'c-carb': peaks330['c-carb'].sum(),
        'm-oxid': peaks330['m-oxid'].sum(),
        'k-oxid': peaks330['k-oxid'].sum(),
        'p-oxid': peaks330['p-oxid'].sum(),
        'r-oxid': peaks330['r-oxid'].sum(),
        'y-oxid': peaks330['y-oxid'].sum(),
        'n-deam': peaks330['n-deam'].sum(),
        'k-meth': peaks330['k-meth'].sum(),
        'r-meth': peaks330['r-meth'].sum(),
        'q-pyro': peaks330['q-pyro'].sum(),
        'k-acet': peaks330['k-acet'].sum(),
        'Total area': peaks330['Area'].sum(),
        'Total length': peaks330['stripped length'].sum()
       }

totalpeaks330 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', \
                                            'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', \
                                            'c-carb', 'm-oxid', 'k-oxid', 'p-oxid', 'r-oxid', \
                                            'y-oxid', 'n-deam', 'k-meth', 'r-meth', 'q-pyro', \
                                            'k-acet', 'Total area', 'Total length'], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalpeaks330['% C w/ carb'] = totalpeaks330['c-carb'] / totalpeaks330['C'] 

# calculate percentage of M's that are oxidized
totalpeaks330['% M w/ oxid'] = totalpeaks330['m-oxid'] / totalpeaks330['M'] 

# calculate percentage of K's that are oxidized
totalpeaks330['% K w/ oxid'] = totalpeaks330['k-oxid'] / totalpeaks330['K'] 

# calculate percentage of P's that are oxidized
totalpeaks330['% P w/ oxid'] = totalpeaks330['p-oxid'] / totalpeaks330['P'] 

# calculate percentage of R's that are oxidized
totalpeaks330['% R w/ oxid'] = totalpeaks330['p-oxid'] / totalpeaks330['R'] 

# calculate percentage of Y's that are oxidized
totalpeaks330['% Y w/ oxid'] = totalpeaks330['y-oxid'] / totalpeaks330['Y'] 

# calculate percentage of N's that are deamidated
totalpeaks330['% N w/ deam'] = totalpeaks330['n-deam'] / totalpeaks330['N'] 

# calculate percentage of K's that are methylated
totalpeaks330['% K w/ meth'] = totalpeaks330['k-meth'] / totalpeaks330['K'] 

# calculate percentage of R's that are methylated
totalpeaks330['% R w/ meth'] = totalpeaks330['r-meth'] / totalpeaks330['R'] 

# calculate percentage of Q's that are pyro glu'd
totalpeaks330['% Q w/ pyro'] = totalpeaks330['q-pyro'] / totalpeaks330['Q'] 

# calculate percentage of K's that are acetylation
totalpeaks330['% K w/ acet'] = totalpeaks330['k-acet'] / totalpeaks330['K'] 

# calculate NAAF denominator for all peptides in dataset i
totalpeaks330['NAAF denom.'] = totalpeaks330['Total area'] / totalpeaks330['Total length']

# write modified dataframe to new txt file
totalpeaks330.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PeaksDN/TW_330_T2_undigested_combine_PTMopt_DN80_totals.csv")

totalpeaks330.head()

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,% K w/ oxid,% P w/ oxid,% R w/ oxid,% Y w/ oxid,% N w/ deam,% K w/ meth,% R w/ meth,% Q w/ pyro,% K w/ acet,NAAF denom.
sample total,2285,98,1570,3308,1133,1855,429,0,2203,3779,...,0.121198,0.216791,0.842458,0.204159,0.36125,0.160236,0.158659,0.060268,0.109396,592839.766634


In [8]:
# use the calculated NAAF factor (in totalpeaks dataframe, above) to caluclate the NAAF 
# NAAF: normalized normalized area abundance factor

# don't have to worry here about DECOY hits messing with Area totals
# but we would with Comet results

NAAF80 = 592839.766634

# use NAAF >80% ALC to get NAAF factor
peaks330['NAAF factor'] = (peaks330['NAAF num.'])/NAAF80

# make a dataframe that contains only what we need: sequences, AAs, PTMs
peaksAAPTM_330 = peaks330[['stripped peptide', 'NAAF factor', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'K', 'I', 'L', \
                                'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', \
                                'k-oxid', 'p-oxid', 'r-oxid', 'y-oxid', 'n-deam', 'k-meth', 'r-meth', \
                                'q-pyro', 'k-acet']].copy()

# multiply the NAAF80 factor by the AA total to normalize its abundance by peak area and peptide length

peaksAAPTM_330['A-NAAF80'] = peaksAAPTM_330['A'] * peaks330['NAAF factor']
peaksAAPTM_330['C-NAAF80'] = peaksAAPTM_330['C'] * peaks330['NAAF factor']
peaksAAPTM_330['D-NAAF80'] = peaksAAPTM_330['D'] * peaks330['NAAF factor']
peaksAAPTM_330['E-NAAF80'] = peaksAAPTM_330['E'] * peaks330['NAAF factor']
peaksAAPTM_330['F-NAAF80'] = peaksAAPTM_330['F'] * peaks330['NAAF factor']
peaksAAPTM_330['G-NAAF80'] = peaksAAPTM_330['G'] * peaks330['NAAF factor']
peaksAAPTM_330['H-NAAF80'] = peaksAAPTM_330['H'] * peaks330['NAAF factor']
peaksAAPTM_330['I-NAAF80'] = peaksAAPTM_330['I'] * peaks330['NAAF factor']
peaksAAPTM_330['K-NAAF80'] = peaksAAPTM_330['K'] * peaks330['NAAF factor']
peaksAAPTM_330['L-NAAF80'] = peaksAAPTM_330['L'] * peaks330['NAAF factor']
peaksAAPTM_330['M-NAAF80'] = peaksAAPTM_330['M'] * peaks330['NAAF factor']
peaksAAPTM_330['N-NAAF80'] = peaksAAPTM_330['N'] * peaks330['NAAF factor']
peaksAAPTM_330['P-NAAF80'] = peaksAAPTM_330['P'] * peaks330['NAAF factor']
peaksAAPTM_330['Q-NAAF80'] = peaksAAPTM_330['Q'] * peaks330['NAAF factor']
peaksAAPTM_330['R-NAAF80'] = peaksAAPTM_330['R'] * peaks330['NAAF factor']
peaksAAPTM_330['S-NAAF80'] = peaksAAPTM_330['S'] * peaks330['NAAF factor']
peaksAAPTM_330['T-NAAF80'] = peaksAAPTM_330['T'] * peaks330['NAAF factor']
peaksAAPTM_330['V-NAAF80'] = peaksAAPTM_330['V'] * peaks330['NAAF factor']
peaksAAPTM_330['W-NAAF80'] = peaksAAPTM_330['W'] * peaks330['NAAF factor']
peaksAAPTM_330['Y-NAAF80'] = peaksAAPTM_330['Y'] * peaks330['NAAF factor']

# multiply the NAAF80 factor by the PTMs normalize its abundance by peak area and peptide length

peaksAAPTM_330['ccarb-NAAF80'] = peaksAAPTM_330['c-carb'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['moxid-NAAF80'] = peaksAAPTM_330['m-oxid'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['koxid-NAAF80'] = peaksAAPTM_330['k-oxid'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['poxid-NAAF80'] = peaksAAPTM_330['p-oxid'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['roxid-NAAF80'] = peaksAAPTM_330['r-oxid'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['yoxid-NAAF80'] = peaksAAPTM_330['y-oxid'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['ndeam-NAAF80'] = peaksAAPTM_330['n-deam'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['kmeth-NAAF80'] = peaksAAPTM_330['k-meth'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['rmeth-NAAF80'] = peaksAAPTM_330['r-meth'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['qpyro-NAAF80'] = peaksAAPTM_330['q-pyro'] * peaksAAPTM_330['NAAF factor']
peaksAAPTM_330['kacet-NAAF80'] = peaksAAPTM_330['k-acet'] * peaksAAPTM_330['NAAF factor']

# write the dataframe to a new csv
peaksAAPTM_330.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PeaksDN/TW_330_T2_undigested_combine_PTMopt_DN80_NAAF.csv")

peaksAAPTM_330.head()

Unnamed: 0,stripped peptide,NAAF factor,A,C,D,E,F,G,H,K,...,moxid-NAAF80,koxid-NAAF80,poxid-NAAF80,roxid-NAAF80,yoxid-NAAF80,ndeam-NAAF80,kmeth-NAAF80,rmeth-NAAF80,qpyro-NAAF80,kacet-NAAF80
0,EEVEKELEDR,0.524594,0,0,1,5,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,KQELEDLTK,0.864015,0,0,1,2,0,0,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,DEETKLSK,6.051382,0,0,1,2,0,0,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,TGVFLKT,0.836169,0,0,0,0,1,1,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,DEETKLSK,1.798547,0,0,1,2,0,0,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
# made a dataframe that's the sum of NAAF corrected AAs and PTMs

index = ['sample total']

data = {'NAAF': peaksAAPTM_330['NAAF factor'].sum(),
        'A': peaksAAPTM_330['A-NAAF80'].sum(),
        'C': peaksAAPTM_330['C-NAAF80'].sum(),
        'D': peaksAAPTM_330['D-NAAF80'].sum(),
        'E': peaksAAPTM_330['E-NAAF80'].sum(),
        'F': peaksAAPTM_330['F-NAAF80'].sum(),
        'G': peaksAAPTM_330['G-NAAF80'].sum(),
        'H': peaksAAPTM_330['H-NAAF80'].sum(),
        'I': peaksAAPTM_330['I-NAAF80'].sum(),
        'K': peaksAAPTM_330['K-NAAF80'].sum(),
        'L': peaksAAPTM_330['L-NAAF80'].sum(),
        'M': peaksAAPTM_330['M-NAAF80'].sum(),
        'N': peaksAAPTM_330['N-NAAF80'].sum(),
        'P': peaksAAPTM_330['P-NAAF80'].sum(),
        'Q': peaksAAPTM_330['Q-NAAF80'].sum(),
        'R': peaksAAPTM_330['R-NAAF80'].sum(),
        'S': peaksAAPTM_330['S-NAAF80'].sum(),
        'T': peaksAAPTM_330['T-NAAF80'].sum(),
        'V': peaksAAPTM_330['V-NAAF80'].sum(),
        'W': peaksAAPTM_330['W-NAAF80'].sum(),
        'Y': peaksAAPTM_330['Y-NAAF80'].sum(),
        'c-carb': peaksAAPTM_330['ccarb-NAAF80'].sum(),
        'm-oxid': peaksAAPTM_330['moxid-NAAF80'].sum(),
        'k-oxid': peaksAAPTM_330['koxid-NAAF80'].sum(),
        'p-oxid': peaksAAPTM_330['poxid-NAAF80'].sum(),
        'r-oxid': peaksAAPTM_330['roxid-NAAF80'].sum(),
        'y-oxid': peaksAAPTM_330['yoxid-NAAF80'].sum(),
        'n-deam': peaksAAPTM_330['ndeam-NAAF80'].sum(),
        'k-meth': peaksAAPTM_330['kmeth-NAAF80'].sum(),
        'r-meth': peaksAAPTM_330['rmeth-NAAF80'].sum(),
        'q-pyro': peaksAAPTM_330['qpyro-NAAF80'].sum(),
        'k-acet': peaksAAPTM_330['kacet-NAAF80'].sum()
       }

totalpeaks80_NAAF = pd.DataFrame(data, columns=['NAAF', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', \
                                           'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', \
                                           'W', 'Y', 'c-carb', 'm-oxid', 'k-oxid', 'p-oxid', \
                                           'r-oxid', 'y-oxid', 'n-deam', 'k-meth', 'r-meth', \
                                            'q-pyro', 'k-acet'], index=index)

# calculate NAAF-corrected percentage of C's with carb (should be 1.0)
totalpeaks80_NAAF['% C w/ carb.'] = totalpeaks80_NAAF['c-carb'] / totalpeaks80_NAAF['C'] 

# calculate NAAF-corrected percentage of M's that are oxidized
totalpeaks80_NAAF['% M w/ oxid'] = totalpeaks80_NAAF['m-oxid'] / totalpeaks80_NAAF['M'] 

# calculate NAAF-corrected percentage of K's that are oxidized
totalpeaks80_NAAF['% K w/ oxid'] = totalpeaks80_NAAF['k-oxid'] / totalpeaks80_NAAF['K'] 

# calculate NAAF-corrected percentage of P's that are oxidized
totalpeaks80_NAAF['% P w/ oxid'] = totalpeaks80_NAAF['p-oxid'] / totalpeaks80_NAAF['P'] 

# calculate NAAF-corrected percentage of R's that are oxidized
totalpeaks80_NAAF['% R w/ oxid'] = totalpeaks80_NAAF['r-oxid'] / totalpeaks80_NAAF['R'] 

# calculate NAAF-corrected percentage of Y's that are oxidized
totalpeaks80_NAAF['% Y w/ oxid'] = totalpeaks80_NAAF['y-oxid'] / totalpeaks80_NAAF['Y'] 

# calculate NAAF-corrected percentage of N's that are deamidated
totalpeaks80_NAAF['% N w/ deam'] = totalpeaks80_NAAF['n-deam'] / totalpeaks80_NAAF['N'] 

# calculate NAAF-corrected percentage of K's that are methylated
totalpeaks80_NAAF['% K w/ meth'] = totalpeaks80_NAAF['k-meth'] / totalpeaks80_NAAF['K'] 

# calculate NAAF-corrected percentage of R's that are methylated
totalpeaks80_NAAF['% R w/ meth'] = totalpeaks80_NAAF['r-meth'] / totalpeaks80_NAAF['R'] 

# calculate NAAF-corrected percentage of Q's that are pyro glu'd
totalpeaks80_NAAF['% Q w/ pyro'] = totalpeaks80_NAAF['q-pyro'] / totalpeaks80_NAAF['Q'] 

# calculate NAAF-corrected percentage of K's that are methylated
totalpeaks80_NAAF['% K w/ acet'] = totalpeaks80_NAAF['k-acet'] / totalpeaks80_NAAF['K'] 

# calculate NAAF summed numerator over denominator (in above cell) for all peptides in dataset i: a check
totalpeaks80_NAAF['NAAF check'] = totalpeaks80_NAAF['NAAF'] / 592839.766634

# write modified dataframe to new txt file, same name + totals
totalpeaks80_NAAF.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PeaksDN/TW_330_T2_undigested_combine_PTMopt_DN80_NAAF_totals.csv")

totalpeaks80_NAAF.head()

Unnamed: 0,NAAF,A,C,D,E,F,G,H,I,K,...,% K w/ oxid,% P w/ oxid,% R w/ oxid,% Y w/ oxid,% N w/ deam,% K w/ meth,% R w/ meth,% Q w/ pyro,% K w/ acet,NAAF check
sample total,3939.393728,1986.026567,74.515359,1618.230008,3915.376662,928.431074,2186.64624,503.795621,0.0,2324.116949,...,0.132248,0.23497,0.070494,0.206837,0.372215,0.187196,0.169599,0.169958,0.149918,0.006645


## Export stripped peptides >80% ALC

In [10]:
# keep only stripped peptide column 
pep80 = peaks330[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep80.to_csv("/home/millieginty/Documents/git-repos/TW_330_T2_undigested_combine_PTMopt_DN80_stripped_peptides.txt", header=False, index=False)

# removing redundancy
peaks80dedup = pd.DataFrame.drop_duplicates(pep80)

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

peaks80dedup.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PeaksDN/TW_330_T2_undigested_combine_PTMopt_DN80_nonredundant_stripped_peptides.txt", header=False, index=False)

print("# redundant stripped Peaks peptides >80% ALC", len(pep80))
print("# nonredundant stripped Peaks peptides >80% ALC", len(peaks80dedup))
print("average peptide length Peaks peptides >80% ALC", peaks330['stripped length'].mean())

# count all unique peptide (modified peptides included)
# keep only peptide column >80% ALC
pep80m = peaks330[["Peptide"]]

# deduplicate
pep80mdedup = pd.DataFrame.drop_duplicates(pep80m)

print("# redundant Peaks peptides >80% ALC", len(pep80m))
print("# nonredundant Peaks peptides", len(pep80mdedup))

# check
pep80.head()

# redundant stripped Peaks peptides >80% ALC 3647
# nonredundant stripped Peaks peptides >80% ALC 3343
average peptide length Peaks peptides >80% ALC 7.813545379764189
# redundant Peaks peptides >80% ALC 3647
# nonredundant Peaks peptides 3433


Unnamed: 0,stripped peptide
0,EEVEKELEDR
1,KQELEDLTK
2,DEETKLSK
3,TGVFLKT
4,DEETKLSK
