### Manipulation of Trans Proteomic Pipeline (TPP) Comet database search results of *Prochlorococus MED4* LC-MS/MS data using python.

Starting with: 

- Comet output (.xlxs and .csv) of PTM-optimized database searches, sorted by XCorr (descending) and run through XInteract to extract precursor intensities and protein descriptions mapped from the search database.

Ending with:

- Files with stripped (no PTMs or tryptic ends) peptide lists and
- Columns with #'s of each modification in every sequence
- Column with stripped peptide lengths (# amino acids)
- Histogram of sequence lengths
- Bar plots of PTM occurance

### To use:

#### 1. Change the input file name in *IN 4*
#### 2. Change output file name in *IN 6*, *IN 7*, *IN 8*

For technical duplicates, I exported Comet search results as both Excel files and as CSVs into my ETNP 2017 git repo:

Also, when running through XInteract in the TPP, I combined the duplicate injections into a single PepXML file which I exported as an xls file and converted to a csv.

In [1]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP


In [2]:
ls

 RAL4_MED2_combine_Comet2.5Xcorr_proteins.txt
 RAL4_MED2_combine_Comet3_AA_NAAF.csv
 RAL4_MED2_combine_Comet3Xcorr_proteins.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_peptides_2.5XCorr.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_peptides.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_work.ods
 RAL4_MED2_trypsin_1_PTMopt_Comet_unfiltered.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet.xlsx
 RAL4_MED2_trypsin_1_PTMopt_PepProp90.csv
 RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped.csv
 RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped_peptides
 RAL4_MED2_trypsin_1_PTMopt_PepProp90.xlsx
 RAL4_MED2_trypsin_2_PTMopt_Comet.csv
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped.csv
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_peptides_2.5XCorr.txt
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_peptides.txt
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_work.ods
 RAL4_MED2_trypsin_2_PTMopt_Comet.xlsx
 RAL4_MED2_trypsi

In [3]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#from matplotlib import pyplot
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [16]:
# formerly, read in the replicates without precursor intensities and protein descriptions:

# read the CSVs of each replicate into a datadrame we name 'comet' using the pandas read_csv function
##comet1 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_Comet.csv")
##comet2 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_2_PTMopt_Comet.csv")

##frames = [comet1, comet2]

# concatenate dataframes
## cometdup = pd.concat(frames, sort=False)

# now, reading in the combined csv that contains precursor intensities and protein descriptions
cometdup = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/ral_95_med2_trypsin_combine_quant.pep.sort.csv", index_col='spectrum')

# remove redundant rows
comet = pd.DataFrame.drop_duplicates(cometdup)

print("# redundant Comet peptides in combined dataframe", len(cometdup))
print("# nonredundant Comet peptides in combined dataframe", len(comet))

comet.head()

# redundant Comet peptides in combined dataframe 65535
# nonredundant Comet peptides in combined dataframe 65535


Unnamed: 0_level_0,xcorr,deltacn,expect,ions,peptide,protein,calc_neutral_pep_mass,precursor_intensity,protein_descr
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,8.78,1.0,5.07e-12,12/204,K.LAIDDSSINLDQVDYIN[115.03]AHGTSTTANDKNETSAIK.S,PMM1609,3734.7759,5250000.0,| fabF | 3-oxoacyl-[acyl-carrier-protein] synt...
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,8.768,1.0,1.42e-09,22/156,K.LFADENHLSPAVTAIQIEDIDAEQFRK.N,PMM0035,3069.5407,2970000.0,| DHSS | soluble hydrogenase small subunit
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,8.599,0.583,2.14e-12,22/156,R.SGLQNAASIAGM[147.04]VLTTEC[160.03]IVADLPEKK.D,PMM1436,2831.4409,7140000.0,| groEL | chaperonin GroEL
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,8.578,1.0,9.31e-10,18/204,K.LAIDDSSIN[115.03]LDQVDYIN[115.03]AHGTSTTANDK...,PMM1609,3735.7599,6550000.0,| fabF | 3-oxoacyl-[acyl-carrier-protein] synt...
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,8.469,0.646,3.43e-14,24/156,R.SGLQNAASIAGM[147.04]VLTTEC[160.03]IVADLPEKK.D,PMM1436,2831.4409,8290000.0,| groEL | chaperonin GroEL


The peptide column has the residues before and after the tryptic terminii as well as masses of modified residues (e.g., 160.03 Da for carbamidomethylated cysteine). We want to make new columns with all that information and make a column with only the 'stripped' peptide sequence that's just amino acids - this we can then align against other sequences, for example. 

Modified residues were allowed for:

- fixed carbamodimethylation of cysteine 160.03 C
- varialbe oxidation of methionine: 147.04 M
- variable deamidation of asparagine: 115.03 N
- variable deamidation of glumatine: 129.04 Q
- variable iron cation on lysine: 182.11 K
- variable methylation of lysine: 142.11 K
- variable methylation of arginine: 170.12 R

We'll then write this manipulated dataframe to a new file.

In [17]:
# get rid of rows where the xcorr is unavailable (usually 3 or so)
comet = comet[comet.xcorr != '[unavailable]']

# use str.strip with indexing by str[0] to add a column with the peptide's left terminus
comet['L terminus'] = comet['peptide'].astype(str).str[0]

# use str.strip with indexing by str[-1] to add a column with the peptide's left terminus
comet['R terminus'] = comet['peptide'].str.strip().str[-1]

# use a count function to enumerate the # of A's (alanines) in each peptide
comet['A'] = comet['peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
comet['C'] = comet['peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
comet['D'] = comet['peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
comet['E'] = comet['peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
comet['F'] = comet['peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
comet['G'] = comet['peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
comet['H'] = comet['peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
comet['I'] = comet['peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
comet['K'] = comet['peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
comet['L'] = comet['peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
comet['M'] = comet['peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
comet['N'] = comet['peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
comet['P'] = comet['peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
comet['Q'] = comet['peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
comet['R'] = comet['peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
comet['S'] = comet['peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
comet['T'] = comet['peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
comet['V'] = comet['peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
comet['W'] = comet['peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
comet['Y'] = comet['peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
comet['c-carb'] = comet['peptide'].str.count("160.03")

# use a count function to enumerate the # of oxidized M's in each peptide
comet['m-oxid'] = comet['peptide'].str.count("147.04")

# use a count function to enumerate the # of deamidated N's in each peptide
comet['n-deam'] = comet['peptide'].str.count("115.03")

# use a count function to enumerate the # of deamidated Q's in each peptide
comet['q-deam'] = comet['peptide'].str.count("129.04")

# use a count function to enumerate the # of iron adducted K's in each peptide
comet['k-iron'] = comet['peptide'].str.count("182.11")

# use a count function to enumerate the # of methylated K's in each peptide
comet['k-meth'] = comet['peptide'].str.count("142.11")

# use a count function to enumerate the # of methylated R's in each peptide
comet['r-meth'] = comet['peptide'].str.count("170.12")

# create a column with 'stripped' peptide sequences using strip
comet['stripped peptide'] = comet['peptide'].str[2:].str[:-2].str.replace(r"\[.*\]","")

# add a column with the stripped peptide length (number of AAs)
comet['stripped length'] = comet['stripped peptide'].apply(len)

# total the number of modifications in sequence
comet['ptm-total'] = comet['c-carb'] + comet['m-oxid'] + comet['n-deam'] + comet['q-deam'] + comet['k-iron'] + comet['k-meth'] + comet['r-meth']

# calculate the NAAF numerator for later NAAF normalization
comet['NAAF num.'] = comet['precursor_intensity'] / comet['stripped length']

# write modified dataframe to new txt file, plus 'unfiltered'
comet.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_Comet_unfiltered.csv")

# check out the results
comet.head()

  res_values = method(rvalues)


Unnamed: 0_level_0,xcorr,deltacn,expect,ions,peptide,protein,calc_neutral_pep_mass,precursor_intensity,protein_descr,L terminus,...,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length,ptm-total,NAAF num.
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,8.78,1.0,5.07e-12,12/204,K.LAIDDSSINLDQVDYIN[115.03]AHGTSTTANDKNETSAIK.S,PMM1609,3734.7759,5250000.0,| fabF | 3-oxoacyl-[acyl-carrier-protein] synt...,K,...,0,1,0,0,0,0,LAIDDSSINLDQVDYINAHGTSTTANDKNETSAIK,35,1,150000.0
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,8.768,1.0,1.42e-09,22/156,K.LFADENHLSPAVTAIQIEDIDAEQFRK.N,PMM0035,3069.5407,2970000.0,| DHSS | soluble hydrogenase small subunit,K,...,0,0,0,0,0,0,LFADENHLSPAVTAIQIEDIDAEQFRK,27,0,110000.0
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,8.599,0.583,2.14e-12,22/156,R.SGLQNAASIAGM[147.04]VLTTEC[160.03]IVADLPEKK.D,PMM1436,2831.4409,7140000.0,| groEL | chaperonin GroEL,R,...,1,0,0,0,0,0,SGLQNAASIAGMIVADLPEKK,21,2,340000.0
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,8.578,1.0,9.31e-10,18/204,K.LAIDDSSIN[115.03]LDQVDYIN[115.03]AHGTSTTANDK...,PMM1609,3735.7599,6550000.0,| fabF | 3-oxoacyl-[acyl-carrier-protein] synt...,K,...,0,2,0,0,0,0,LAIDDSSINAHGTSTTANDKNETSAIK,27,2,242592.592593
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,8.469,0.646,3.43e-14,24/156,R.SGLQNAASIAGM[147.04]VLTTEC[160.03]IVADLPEKK.D,PMM1436,2831.4409,8290000.0,| groEL | chaperonin GroEL,R,...,1,0,0,0,0,0,SGLQNAASIAGMIVADLPEKK,21,2,394761.904762


## Calculating the false discovery rate (% FDR)

### Filtering PSMs > a selected XCorr value and exporting peptides

In [18]:
# Let's separate out the decoy hits from the good ones

cometpmm = comet[~comet['protein'].str.contains("DECOY")]
cometdec = comet[comet['protein'].str.contains("DECOY")]

# how many PSM that are only PMM (proteins in the database)?

print("# real Comet PSMs", len(cometpmm))

# compared to how many PSMs containing decoys?

print("# decoy Comet PSMs", len(cometdec))

# calculate the bulk FDR (all PSMs so let's not beat ourselves up)

r = len(cometpmm)
d = len(cometdec)

FDR = d/r*100

print("False discovery rate = ", FDR)

# real Comet PSMs 54797
# decoy Comet PSMs 10738
False discovery rate =  19.59596328266146


In [19]:
# keep only peptides  >2.5 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
comet['xcorr'] = pd.to_numeric(comet['xcorr'])

comet25 = comet.loc[comet['xcorr'] >= 2.5]

# What's the FDR?

# Let's separate out the decoy hits from the good ones

cometpmm25 = comet25[~comet25['protein'].str.contains("DECOY")]
cometdec25 = comet25[comet25['protein'].str.contains("DECOY")]

# how many PSM that are only PMM (proteins in the database)?

print("# real Comet PSMs", len(cometpmm25))

# compared to how many PSMs containing decoys?

print("# decoy Comet PSMs", len(cometdec25))

# calculate the FDR 

r = len(cometpmm25)
d = len(cometdec25)

FDR = d/(d+r)*100

print("False discovery rate = ", FDR)

# real Comet PSMs 35931
# decoy Comet PSMs 1270
False discovery rate =  3.4138867234751755


In [8]:
# keep only peptides  >3 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
comet['xcorr'] = pd.to_numeric(comet['xcorr'])

comet3 = comet.loc[comet['xcorr'] >= 3]

# What's the FDR?

# Let's separate out the decoy hits from the good ones

cometpmm3 = comet3[~comet3['protein'].str.contains("DECOY")]
cometdec3 = comet3[comet3['protein'].str.contains("DECOY")]

# how many PSM that are only PMM (proteins in the database)?

print("# real Comet PSMs", len(cometpmm3))

# compared to how many PSMs containing decoys?

print("# decoy Comet PSMs", len(cometdec3))

# calculate the FDR 

r = len(cometpmm3)
d = len(cometdec3)

FDR = d/(d+r)*100

print("False discovery rate = ", FDR)

# real Comet PSMs 26923
# decoy Comet PSMs 258
False discovery rate =  0.9491924506088812


### Exporting peptides from a XCorr > 2.5 and XCorr > 3 thresholds:

In [20]:
# keep only peptides  >2.5 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
comet['xcorr'] = pd.to_numeric(comet['xcorr'])

comet25 = comet.loc[comet['xcorr'] >= 2.5]

# Let's separate out the decoy hits from the good ones

cometpmm25 = comet25[~comet25['protein'].str.contains("DECOY")]
cometdec25 = comet25[comet25['protein'].str.contains("DECOY")]


# keep only peptide column 
pep25 = cometpmm25[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep25.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_combine_PTMopt_Comet_stripped_peptides_2.5XCorr.txt", header=False, index=False)

# removing redundancy
pep25dedup = pd.DataFrame.drop_duplicates(pep25)

print("# redundant Comet peptides >2.5 XCorr", len(pep25))
print("# nonredundant Comet peptides >2.5 XCOrr", len(pep25dedup))

pep25.head()

# redundant Comet peptides >2.5 XCorr 35931
# nonredundant Comet peptides >2.5 XCOrr 12283


Unnamed: 0_level_0,stripped peptide
spectrum,Unnamed: 1_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,LAIDDSSINLDQVDYINAHGTSTTANDKNETSAIK
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,LFADENHLSPAVTAIQIEDIDAEQFRK
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,SGLQNAASIAGMIVADLPEKK
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,LAIDDSSINAHGTSTTANDKNETSAIK
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,SGLQNAASIAGMIVADLPEKK


In [21]:
# keep only peptides  >3 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
comet['xcorr'] = pd.to_numeric(comet['xcorr'])

comet3 = comet.loc[comet['xcorr'] >= 3]

# Let's separate out the decoy hits from the good ones

cometpmm3 = comet3[~comet3['protein'].str.contains("DECOY")]
cometdec3 = comet3[comet3['protein'].str.contains("DECOY")]

# export the whole table for Comet XCorr > 3
cometpmm3.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_combine_PTMopt_Comet_3XCorr_noDECOY.csv")

# keep only peptide column 
pep3 = cometpmm3[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep3.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_combine_PTMopt_Comet_stripped_peptides_3XCorr.txt", header=False, index=False)

# removing redundancy
pep3dedup = pd.DataFrame.drop_duplicates(pep3)

print("# redundant Comet peptides >3 XCorr", len(pep3))
print("# nonredundant Comet peptides >3 XCOrr", len(pep3dedup))

pep3.head()

# redundant Comet peptides >3 XCorr 26923
# nonredundant Comet peptides >3 XCOrr 9213


Unnamed: 0_level_0,stripped peptide
spectrum,Unnamed: 1_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,LAIDDSSINLDQVDYINAHGTSTTANDKNETSAIK
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,LFADENHLSPAVTAIQIEDIDAEQFRK
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,SGLQNAASIAGMIVADLPEKK
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,LAIDDSSINAHGTSTTANDKNETSAIK
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,SGLQNAASIAGMIVADLPEKK


## NAAF correction and exporting files with AA and PTM totals:

In [29]:
# made a new dataframe that contains the sums of certain columns in the stripped peptide dataframe above
# choosing the XCorr > 3 filtered results (no decoys)

index = ['sample total']

data = {'A': cometpmm3['A'].sum(),
        'C': cometpmm3['C'].sum(),
        'D': cometpmm3['D'].sum(),
        'E': cometpmm3['E'].sum(),
        'F': cometpmm3['F'].sum(),
        'G': cometpmm3['G'].sum(),
        'H': cometpmm3['H'].sum(),
        'I': cometpmm3['I'].sum(),
        'K': cometpmm3['K'].sum(),
        'L': cometpmm3['L'].sum(),
        'M': cometpmm3['M'].sum(),
        'N': cometpmm3['N'].sum(),
        'P': cometpmm3['P'].sum(),
        'Q': cometpmm3['Q'].sum(),
        'R': cometpmm3['R'].sum(),
        'S': cometpmm3['S'].sum(),
        'T': cometpmm3['T'].sum(),
        'V': cometpmm3['V'].sum(),
        'W': cometpmm3['W'].sum(),
        'Y': cometpmm3['Y'].sum(),
        'c-carb': cometpmm3['c-carb'].sum(),
        'm-oxid': cometpmm3['m-oxid'].sum(),
        'n-deam': cometpmm3['n-deam'].sum(),
        'q-deam': cometpmm3['q-deam'].sum(),
        'k-iron': cometpmm3['k-iron'].sum(),
        'k-meth': cometpmm3['k-meth'].sum(),
        'r-meth': cometpmm3['r-meth'].sum(),
        'Total area': cometpmm3['precursor_intensity'].sum(),
        'Total length': cometpmm3['stripped length'].sum()
       }

totalcometpmm3 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', \
                                           'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', \
                                           'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', \
                                           'q-deam', 'k-iron', 'k-meth', 'r-meth', \
                                          'Total area', 'Total length'], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalcometpmm3['% C w/ carb'] = totalcometpmm3['c-carb'] / totalcometpmm3['C'] 

# calculate percentage of M's that are oxidized
totalcometpmm3['% M w/ oxid'] = totalcometpmm3['m-oxid'] / totalcometpmm3['M'] 

# calculate percentage of N's that are deamidated
totalcometpmm3['% N w/ deam'] = totalcometpmm3['n-deam'] / totalcometpmm3['N'] 

# calculate percentage of Q's that are deamidated
totalcometpmm3['% Q w/ deam'] = totalcometpmm3['q-deam'] / totalcometpmm3['Q'] 

# calculate percentage of K's that are hydroxylated
totalcometpmm3['% K w/ iron'] = totalcometpmm3['k-iron'] / totalcometpmm3['K'] 

# calculate percentage of K's that are methylated
totalcometpmm3['% K w/ meth'] = totalcometpmm3['k-meth'] / totalcometpmm3['K'] 

# calculate percentage of R's that are methylated
totalcometpmm3['% R w/ meth'] = totalcometpmm3['r-meth'] / totalcometpmm3['R'] 

# calculate NAAF denominator for all peptides in dataset i
totalcometpmm3['NAAF denom.'] = totalcometpmm3['Total area'] / totalcometpmm3['Total length']

# write modified dataframe to new txt file, same name + totals
totalcometpmm3.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL95_MED2_trypsin_combine_cometpmm3_totals.csv")

totalcometpmm3.head()

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,Total area,Total length,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,% K w/ iron,% K w/ meth,% R w/ meth,NAAF denom.
sample total,34390,3829,29556,36962,16265,34557,6414,35470,41629,49572,...,123693900000.0,412111,0.92818,0.292821,0.066992,0.067926,0.007423,0.003459,0.003377,300146.975419


In [30]:
# use the calculated NAAF factor (from above) to caluclate the NAAF 
# NAAF: normalized normalized area abundance factor

NAAF3 = 300146.975419

# use NAAF >XCorr 3 to get NAAF
cometpmm3['NAAF factor'] = (cometpmm3['NAAF num.'])/NAAF3

# make a dataframe that contains only what we need: sequences, AAs, PTMs
cometpmm3_AA = cometpmm3[['stripped peptide', 'NAAF factor', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'K', 'I', 'L', \
                                'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', \
                                'q-deam', 'k-iron', 'k-meth', 'r-meth']].copy()

# multiply the NAAF3 factor by the AAs to normalize its abundance by peak area and peptide length

cometpmm3_AA['A-NAAF3'] = cometpmm3_AA['A'] * cometpmm3['NAAF factor']
cometpmm3_AA['C-NAAF3'] = cometpmm3_AA['C'] * cometpmm3['NAAF factor']
cometpmm3_AA['D-NAAF3'] = cometpmm3_AA['D'] * cometpmm3['NAAF factor']
cometpmm3_AA['E-NAAF3'] = cometpmm3_AA['E'] * cometpmm3['NAAF factor']
cometpmm3_AA['F-NAAF3'] = cometpmm3_AA['F'] * cometpmm3['NAAF factor']
cometpmm3_AA['G-NAAF3'] = cometpmm3_AA['G'] * cometpmm3['NAAF factor']
cometpmm3_AA['H-NAAF3'] = cometpmm3_AA['H'] * cometpmm3['NAAF factor']
cometpmm3_AA['K-NAAF3'] = cometpmm3_AA['K'] * cometpmm3['NAAF factor']
cometpmm3_AA['I-NAAF3'] = cometpmm3_AA['I'] * cometpmm3['NAAF factor']
cometpmm3_AA['L-NAAF3'] = cometpmm3_AA['L'] * cometpmm3['NAAF factor']
cometpmm3_AA['M-NAAF3'] = cometpmm3_AA['M'] * cometpmm3['NAAF factor']
cometpmm3_AA['N-NAAF3'] = cometpmm3_AA['N'] * cometpmm3['NAAF factor']
cometpmm3_AA['P-NAAF3'] = cometpmm3_AA['P'] * cometpmm3['NAAF factor']
cometpmm3_AA['Q-NAAF3'] = cometpmm3_AA['Q'] * cometpmm3['NAAF factor']
cometpmm3_AA['R-NAAF3'] = cometpmm3_AA['R'] * cometpmm3['NAAF factor']
cometpmm3_AA['S-NAAF3'] = cometpmm3_AA['S'] * cometpmm3['NAAF factor']
cometpmm3_AA['T-NAAF3'] = cometpmm3_AA['T'] * cometpmm3['NAAF factor']
cometpmm3_AA['V-NAAF3'] = cometpmm3_AA['V'] * cometpmm3['NAAF factor']
cometpmm3_AA['W-NAAF3'] = cometpmm3_AA['W'] * cometpmm3['NAAF factor']
cometpmm3_AA['Y-NAAF3'] = cometpmm3_AA['Y'] * cometpmm3['NAAF factor']

# multiply the NAAF3 factor by the PTMs normalize its abundance by peak area and peptide length

cometpmm3_AA['ccarb-NAAF3'] = cometpmm3_AA['c-carb'] * cometpmm3_AA['NAAF factor']
cometpmm3_AA['moxid-NAAF3'] = cometpmm3_AA['m-oxid'] * cometpmm3_AA['NAAF factor']
cometpmm3_AA['ndeam-NAAF3'] = cometpmm3_AA['n-deam'] * cometpmm3_AA['NAAF factor']
cometpmm3_AA['qdeam-NAAF3'] = cometpmm3_AA['q-deam'] * cometpmm3_AA['NAAF factor']
cometpmm3_AA['kiron-NAAF3'] = cometpmm3_AA['k-iron'] * cometpmm3_AA['NAAF factor']
cometpmm3_AA['kmeth-NAAF3'] = cometpmm3_AA['k-meth'] * cometpmm3_AA['NAAF factor']
cometpmm3_AA['rmeth-NAAF3'] = cometpmm3_AA['r-meth'] * cometpmm3_AA['NAAF factor']

# write the dataframe to a new csv
cometpmm3_AA.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_combine_Comet3_AA_NAAF.csv")

cometpmm3_AA.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cometpmm3['NAAF factor'] = (cometpmm3['NAAF num.'])/NAAF3


Unnamed: 0_level_0,stripped peptide,NAAF factor,A,C,D,E,F,G,H,K,...,V-NAAF3,W-NAAF3,Y-NAAF3,ccarb-NAAF3,moxid-NAAF3,ndeam-NAAF3,qdeam-NAAF3,kiron-NAAF3,kmeth-NAAF3,rmeth-NAAF3
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,LAIDDSSINLDQVDYINAHGTSTTANDKNETSAIK,0.499755,4,0,5,1,0,1,1,3,...,0.499755,0.0,0.499755,0.0,0.0,0.499755,0.0,0.0,0.0,0.0
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,LFADENHLSPAVTAIQIEDIDAEQFRK,0.366487,4,0,3,3,2,0,1,2,...,0.366487,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,SGLQNAASIAGMIVADLPEKK,1.132778,4,1,2,2,0,2,0,2,...,2.265557,0.0,0.0,1.132778,1.132778,0.0,0.0,0.0,0.0,0.0
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,LAIDDSSINAHGTSTTANDKNETSAIK,0.808246,4,0,5,1,0,1,1,3,...,0.808246,0.0,0.808246,0.0,0.0,1.616492,0.0,0.0,0.0,0.0
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,SGLQNAASIAGMIVADLPEKK,1.315229,4,1,2,2,0,2,0,2,...,2.630457,0.0,0.0,1.315229,1.315229,0.0,0.0,0.0,0.0,0.0


In [33]:
# made a dataframe that's the sum of NAAF corrected AAs and PTMs

index = ['sample total']

data = {'NAAF': cometpmm3_AA['NAAF factor'].sum(),
        'A': cometpmm3_AA['A-NAAF3'].sum(),
        'C': cometpmm3_AA['C-NAAF3'].sum(),
        'D': cometpmm3_AA['D-NAAF3'].sum(),
        'E': cometpmm3_AA['E-NAAF3'].sum(),
        'F': cometpmm3_AA['F-NAAF3'].sum(),
        'G': cometpmm3_AA['G-NAAF3'].sum(),
        'H': cometpmm3_AA['H-NAAF3'].sum(),
        'I': cometpmm3_AA['I-NAAF3'].sum(),
        'K': cometpmm3_AA['K-NAAF3'].sum(),
        'L': cometpmm3_AA['L-NAAF3'].sum(),
        'M': cometpmm3_AA['M-NAAF3'].sum(),
        'N': cometpmm3_AA['N-NAAF3'].sum(),
        'P': cometpmm3_AA['P-NAAF3'].sum(),
        'Q': cometpmm3_AA['Q-NAAF3'].sum(),
        'R': cometpmm3_AA['R-NAAF3'].sum(),
        'S': cometpmm3_AA['S-NAAF3'].sum(),
        'T': cometpmm3_AA['T-NAAF3'].sum(),
        'V': cometpmm3_AA['V-NAAF3'].sum(),
        'W': cometpmm3_AA['W-NAAF3'].sum(),
        'Y': cometpmm3_AA['Y-NAAF3'].sum(),
        'c-carb': cometpmm3_AA['ccarb-NAAF3'].sum(),
        'm-oxid': cometpmm3_AA['moxid-NAAF3'].sum(),
        'n-deam': cometpmm3_AA['ndeam-NAAF3'].sum(),
        'q-deam': cometpmm3_AA['qdeam-NAAF3'].sum(),
        'k-iron': cometpmm3_AA['kiron-NAAF3'].sum(),
        'k-meth': cometpmm3_AA['kmeth-NAAF3'].sum(),
        'r-meth': cometpmm3_AA['rmeth-NAAF3'].sum()
       }

totalcometpmm3_NAAF = pd.DataFrame(data, columns=['NAAF', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', \
                                           'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', \
                                           'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', \
                                           'q-deam', 'k-iron', 'k-meth', 'r-meth' \
                                          ], index=index)

# calculate NAAF-corrected percentage of C's with carb (should be 1.0)
totalcometpmm3_NAAF['% C w/ carb'] = totalcometpmm3_NAAF['c-carb'] / totalcometpmm3_NAAF['C'] 

# calculate NAAF-corrected percentage of M's that are oxidized
totalcometpmm3_NAAF['% M w/ oxid'] = totalcometpmm3_NAAF['m-oxid'] / totalcometpmm3_NAAF['M'] 

# calculate NAAF-corrected percentage of N's that are deamidated
totalcometpmm3_NAAF['% N w/ deam'] = totalcometpmm3_NAAF['n-deam'] / totalcometpmm3_NAAF['N'] 

# calculate NAAF-corrected percentage of Q's that are deamidated
totalcometpmm3_NAAF['% Q w/ deam'] = totalcometpmm3_NAAF['q-deam'] / totalcometpmm3_NAAF['Q'] 

# calculate NAAF-corrected percentage of K's that are hydroxylated
totalcometpmm3_NAAF['% K w/ iron'] = totalcometpmm3_NAAF['k-iron'] / totalcometpmm3_NAAF['K'] 

# calculate NAAF-corrected percentage of K's that are methylated
totalcometpmm3_NAAF['% K w/ meth'] = totalcometpmm3_NAAF['k-meth'] / totalcometpmm3_NAAF['K'] 

# calculate NAAF-corrected percentage of R's that are methylated
totalcometpmm3_NAAF['% R w/ meth'] = totalcometpmm3_NAAF['r-meth'] / totalcometpmm3_NAAF['R'] 

# calculate NAAF summed numerator over denominator (in above cell) for all peptides in dataset i: a check
totalcometpmm3_NAAF['NAAF check'] = totalcometpmm3_NAAF['NAAF'] / 300146.975419

# write modified dataframe to new txt file, same name + totals
totalcometpmm3_NAAF.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL95_MED2_trypsin_combine_Comet3_NAAF_totals.csv")

totalcometpmm3_NAAF.head()

Unnamed: 0,NAAF,A,C,D,E,F,G,H,I,K,...,k-meth,r-meth,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,% K w/ iron,% K w/ meth,% R w/ meth,NAAF check
sample total,31802.071269,43717.228046,2475.403799,28388.531327,38901.639753,13339.36148,38781.399628,4155.241209,32998.556992,48957.859374,...,296.749839,154.226763,0.893733,0.196407,0.055786,0.037186,0.008633,0.006061,0.006967,0.105955


### Same thing but for Comet > XCorr 2.5

In [26]:
# made a new dataframe that contains the sums of certain columns in the stripped peptide dataframe above
# choosing the XCorr > 2.5 filtered results (no decoys)

index = ['sample total']

data = {'A': cometpmm25['A'].sum(),
        'C': cometpmm25['C'].sum(),
        'D': cometpmm25['D'].sum(),
        'E': cometpmm25['E'].sum(),
        'F': cometpmm25['F'].sum(),
        'G': cometpmm25['G'].sum(),
        'H': cometpmm25['H'].sum(),
        'I': cometpmm25['I'].sum(),
        'K': cometpmm25['K'].sum(),
        'L': cometpmm25['L'].sum(),
        'M': cometpmm25['M'].sum(),
        'N': cometpmm25['N'].sum(),
        'P': cometpmm25['P'].sum(),
        'Q': cometpmm25['Q'].sum(),
        'R': cometpmm25['R'].sum(),
        'S': cometpmm25['S'].sum(),
        'T': cometpmm25['T'].sum(),
        'V': cometpmm25['V'].sum(),
        'W': cometpmm25['W'].sum(),
        'Y': cometpmm25['Y'].sum(),
        'c-carb': cometpmm25['c-carb'].sum(),
        'm-oxid': cometpmm25['m-oxid'].sum(),
        'n-deam': cometpmm25['n-deam'].sum(),
        'q-deam': cometpmm25['q-deam'].sum(),
        'k-iron': cometpmm25['k-iron'].sum(),
        'k-meth': cometpmm25['k-meth'].sum(),
        'r-meth': cometpmm25['r-meth'].sum(),
        'Total area': cometpmm25['precursor_intensity'].sum(),
        'Total length': cometpmm25['stripped length'].sum()
       }

totalcometpmm25 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', \
                                           'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', \
                                           'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', \
                                           'q-deam', 'k-iron', 'k-meth', 'r-meth', \
                                          'Total area', 'Total length'], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalcometpmm25['% C w/ carb'] = totalcometpmm25['c-carb'] / totalcometpmm25['C'] 

# calculate percentage of M's that are oxidized
totalcometpmm25['% M w/ oxid'] = totalcometpmm25['m-oxid'] / totalcometpmm25['M'] 

# calculate percentage of N's that are deamidated
totalcometpmm25['% N w/ deam'] = totalcometpmm25['n-deam'] / totalcometpmm25['N'] 

# calculate percentage of Q's that are deamidated
totalcometpmm25['% Q w/ deam'] = totalcometpmm25['q-deam'] / totalcometpmm25['Q'] 

# calculate percentage of K's that are hydroxylated
totalcometpmm25['% K w/ iron'] = totalcometpmm25['k-iron'] / totalcometpmm25['K'] 

# calculate percentage of K's that are methylated
totalcometpmm25['% K w/ meth'] = totalcometpmm25['k-meth'] / totalcometpmm25['K'] 

# calculate percentage of R's that are methylated
totalcometpmm25['% R w/ meth'] = totalcometpmm25['r-meth'] / totalcometpmm25['R'] 

# calculate NAAF denominator for all peptides in dataset i
totalcometpmm25['NAAF denom.'] = totalcometpmm25['Total area'] / totalcometpmm25['Total length']

# write modified dataframe to new txt file, same name + totals
totalcometpmm25.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL95_MED2_trypsin_combine_cometpmm25_totals.csv")

totalcometpmm25.head()

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,Total area,Total length,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,% K w/ iron,% K w/ meth,% R w/ meth,NAAF denom.
sample total,43387,5043,38019,47552,21887,44168,7846,46572,55616,64440,...,150226100000.0,527301,0.927622,0.304908,0.084605,0.08643,0.018646,0.010447,0.009026,284896.295991


In [27]:
# use the calculated NAAF factor (from above) to caluclate the NAAF 
# NAAF: normalized normalized area abundance factor

NAAF25 = 284896.295991

# use NAAF >XCorr 25 to get NAAF
cometpmm25['NAAF factor'] = (cometpmm25['NAAF num.'])/NAAF25

# make a dataframe that contains only what we need: sequences, AAs, PTMs
cometpmm25_AA = cometpmm25[['stripped peptide', 'NAAF factor', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'K', 'I', 'L', \
                                'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', \
                                'q-deam', 'k-iron', 'k-meth', 'r-meth']].copy()

# multiply the NAAF25 factor by the AAs to normalize its abundance by peak area and peptide length

cometpmm25_AA['A-NAAF25'] = cometpmm25_AA['A'] * cometpmm25['NAAF factor']
cometpmm25_AA['C-NAAF25'] = cometpmm25_AA['C'] * cometpmm25['NAAF factor']
cometpmm25_AA['D-NAAF25'] = cometpmm25_AA['D'] * cometpmm25['NAAF factor']
cometpmm25_AA['E-NAAF25'] = cometpmm25_AA['E'] * cometpmm25['NAAF factor']
cometpmm25_AA['F-NAAF25'] = cometpmm25_AA['F'] * cometpmm25['NAAF factor']
cometpmm25_AA['G-NAAF25'] = cometpmm25_AA['G'] * cometpmm25['NAAF factor']
cometpmm25_AA['H-NAAF25'] = cometpmm25_AA['H'] * cometpmm25['NAAF factor']
cometpmm25_AA['K-NAAF25'] = cometpmm25_AA['K'] * cometpmm25['NAAF factor']
cometpmm25_AA['I-NAAF25'] = cometpmm25_AA['I'] * cometpmm25['NAAF factor']
cometpmm25_AA['L-NAAF25'] = cometpmm25_AA['L'] * cometpmm25['NAAF factor']
cometpmm25_AA['M-NAAF25'] = cometpmm25_AA['M'] * cometpmm25['NAAF factor']
cometpmm25_AA['N-NAAF25'] = cometpmm25_AA['N'] * cometpmm25['NAAF factor']
cometpmm25_AA['P-NAAF25'] = cometpmm25_AA['P'] * cometpmm25['NAAF factor']
cometpmm25_AA['Q-NAAF25'] = cometpmm25_AA['Q'] * cometpmm25['NAAF factor']
cometpmm25_AA['R-NAAF25'] = cometpmm25_AA['R'] * cometpmm25['NAAF factor']
cometpmm25_AA['S-NAAF25'] = cometpmm25_AA['S'] * cometpmm25['NAAF factor']
cometpmm25_AA['T-NAAF25'] = cometpmm25_AA['T'] * cometpmm25['NAAF factor']
cometpmm25_AA['V-NAAF25'] = cometpmm25_AA['V'] * cometpmm25['NAAF factor']
cometpmm25_AA['W-NAAF25'] = cometpmm25_AA['W'] * cometpmm25['NAAF factor']
cometpmm25_AA['Y-NAAF25'] = cometpmm25_AA['Y'] * cometpmm25['NAAF factor']

# multiply the NAAF25 factor by the PTMs normalize its abundance by peak area and peptide length

cometpmm25_AA['ccarb-NAAF25'] = cometpmm25_AA['c-carb'] * cometpmm25_AA['NAAF factor']
cometpmm25_AA['moxid-NAAF25'] = cometpmm25_AA['m-oxid'] * cometpmm25_AA['NAAF factor']
cometpmm25_AA['ndeam-NAAF25'] = cometpmm25_AA['n-deam'] * cometpmm25_AA['NAAF factor']
cometpmm25_AA['qdeam-NAAF25'] = cometpmm25_AA['q-deam'] * cometpmm25_AA['NAAF factor']
cometpmm25_AA['kiron-NAAF25'] = cometpmm25_AA['k-iron'] * cometpmm25_AA['NAAF factor']
cometpmm25_AA['kmeth-NAAF25'] = cometpmm25_AA['k-meth'] * cometpmm25_AA['NAAF factor']
cometpmm25_AA['rmeth-NAAF25'] = cometpmm25_AA['r-meth'] * cometpmm25_AA['NAAF factor']

# write the dataframe to a new csv
cometpmm25_AA.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_combine_Comet25_AA_NAAF.csv")

cometpmm25_AA.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cometpmm25['NAAF factor'] = (cometpmm25['NAAF num.'])/NAAF25


Unnamed: 0_level_0,stripped peptide,NAAF factor,A,C,D,E,F,G,H,K,...,V-NAAF25,W-NAAF25,Y-NAAF25,ccarb-NAAF25,moxid-NAAF25,ndeam-NAAF25,qdeam-NAAF25,kiron-NAAF25,kmeth-NAAF25,rmeth-NAAF25
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,LAIDDSSINLDQVDYINAHGTSTTANDKNETSAIK,0.526507,4,0,5,1,0,1,1,3,...,0.526507,0.0,0.526507,0.0,0.0,0.526507,0.0,0.0,0.0,0.0
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,LFADENHLSPAVTAIQIEDIDAEQFRK,0.386105,4,0,3,3,2,0,1,2,...,0.386105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,SGLQNAASIAGMIVADLPEKK,1.193417,4,1,2,2,0,2,0,2,...,2.386833,0.0,0.0,1.193417,1.193417,0.0,0.0,0.0,0.0,0.0
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,LAIDDSSINAHGTSTTANDKNETSAIK,0.851512,4,0,5,1,0,1,1,3,...,0.851512,0.0,0.851512,0.0,0.0,1.703024,0.0,0.0,0.0,0.0
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,SGLQNAASIAGMIVADLPEKK,1.385634,4,1,2,2,0,2,0,2,...,2.771267,0.0,0.0,1.385634,1.385634,0.0,0.0,0.0,0.0,0.0


In [32]:
# made a dataframe that's the sum of NAAF corrected AAs and PTMs

index = ['sample total']

data = {'NAAF': cometpmm25_AA['NAAF factor'].sum(),
        'A': cometpmm25_AA['A-NAAF25'].sum(),
        'C': cometpmm25_AA['C-NAAF25'].sum(),
        'D': cometpmm25_AA['D-NAAF25'].sum(),
        'E': cometpmm25_AA['E-NAAF25'].sum(),
        'F': cometpmm25_AA['F-NAAF25'].sum(),
        'G': cometpmm25_AA['G-NAAF25'].sum(),
        'H': cometpmm25_AA['H-NAAF25'].sum(),
        'I': cometpmm25_AA['I-NAAF25'].sum(),
        'K': cometpmm25_AA['K-NAAF25'].sum(),
        'L': cometpmm25_AA['L-NAAF25'].sum(),
        'M': cometpmm25_AA['M-NAAF25'].sum(),
        'N': cometpmm25_AA['N-NAAF25'].sum(),
        'P': cometpmm25_AA['P-NAAF25'].sum(),
        'Q': cometpmm25_AA['Q-NAAF25'].sum(),
        'R': cometpmm25_AA['R-NAAF25'].sum(),
        'S': cometpmm25_AA['S-NAAF25'].sum(),
        'T': cometpmm25_AA['T-NAAF25'].sum(),
        'V': cometpmm25_AA['V-NAAF25'].sum(),
        'W': cometpmm25_AA['W-NAAF25'].sum(),
        'Y': cometpmm25_AA['Y-NAAF25'].sum(),
        'c-carb': cometpmm25_AA['ccarb-NAAF25'].sum(),
        'm-oxid': cometpmm25_AA['moxid-NAAF25'].sum(),
        'n-deam': cometpmm25_AA['ndeam-NAAF25'].sum(),
        'q-deam': cometpmm25_AA['qdeam-NAAF25'].sum(),
        'k-iron': cometpmm25_AA['kiron-NAAF25'].sum(),
        'k-meth': cometpmm25_AA['kmeth-NAAF25'].sum(),
        'r-meth': cometpmm25_AA['rmeth-NAAF25'].sum()
       }

totalcometpmm25_NAAF = pd.DataFrame(data, columns=['NAAF', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', \
                                           'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', \
                                           'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', \
                                           'q-deam', 'k-iron', 'k-meth', 'r-meth' \
                                          ], index=index)

# calculate NAAF-corrected percentage of C's with carb (should be 1.0)
totalcometpmm25_NAAF['% C w/ carb'] = totalcometpmm25_NAAF['c-carb'] / totalcometpmm25_NAAF['C'] 

# calculate NAAF-corrected percentage of M's that are oxidized
totalcometpmm25_NAAF['% M w/ oxid'] = totalcometpmm25_NAAF['m-oxid'] / totalcometpmm25_NAAF['M'] 

# calculate NAAF-corrected percentage of N's that are deamidated
totalcometpmm25_NAAF['% N w/ deam'] = totalcometpmm25_NAAF['n-deam'] / totalcometpmm25_NAAF['N'] 

# calculate NAAF-corrected percentage of Q's that are deamidated
totalcometpmm25_NAAF['% Q w/ deam'] = totalcometpmm25_NAAF['q-deam'] / totalcometpmm25_NAAF['Q'] 

# calculate NAAF-corrected percentage of K's that are hydroxylated
totalcometpmm25_NAAF['% K w/ iron'] = totalcometpmm25_NAAF['k-iron'] / totalcometpmm25_NAAF['K'] 

# calculate NAAF-corrected percentage of K's that are methylated
totalcometpmm25_NAAF['% K w/ meth'] = totalcometpmm25_NAAF['k-meth'] / totalcometpmm25_NAAF['K'] 

# calculate NAAF-corrected percentage of R's that are methylated
totalcometpmm25_NAAF['% R w/ meth'] = totalcometpmm25_NAAF['r-meth'] / totalcometpmm25_NAAF['R'] 

# calculate NAAF summed numerator over denominator (in above cell) for all peptides in dataset i: a check
totalcometpmm25_NAAF['NAAF check'] = totalcometpmm25_NAAF['NAAF'] / 284896.295991

# write modified dataframe to new txt file, same name + totals
totalcometpmm25_NAAF.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL95_MED2_trypsin_combine_Comet25_NAAF_totals.csv")

totalcometpmm25_NAAF.head()

Unnamed: 0,NAAF,A,C,D,E,F,G,H,I,K,...,k-meth,r-meth,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,% K w/ iron,% K w/ meth,% R w/ meth,NAAF check
sample total,45490.83749,55935.7066,4646.427904,36368.608101,53458.408691,20473.760802,49778.157307,6800.166785,49325.893821,71764.656245,...,2110.530996,2046.199542,0.928694,0.196043,0.142628,0.06945,0.06867,0.029409,0.061941,0.159675


### All XCorr - Visualizing the results:

In [None]:
# making evenly spaced bins for the Xcorr data based on the min and max, called above
bins = [0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9]
labels = ['0-0.5', '0.5-1', '1-1.5', '1.5-2', '2-2.5', '2.5-3', '3-3.5', '3.5-4', '4-4.5', '4.5-5', '5-5.5', '5.5-6', '6-6.5', '6.5-7', '7-7.5', '7.5-8', '8-8.5', '8.5-9']

# use pandas cut function to do the binning itself
comet['binned'] = pd.cut(comet['xcorr'], bins=bins, labels=labels)

# bar plots of binned PTM data

index = ['0-0.5', '0.5-1', '1-1.5', '1.5-2', '2-2.5', '2.5-3', '3-3.5', '3.5-4', '4-4.5', '4.5-5', '5-5.5', '5.5-6', '6-6.5', '6.5-7', '7-7.5', '7.5-8', '8-8.5', '8.5-9']
data = {'Total PTMs': [comet.groupby('binned')['ptm-total'].sum()['0-0.5'], comet.groupby('binned')['ptm-total'].sum()['0.5-1'], comet.groupby('binned')['ptm-total'].sum()['1-1.5'], comet.groupby('binned')['ptm-total'].sum()['1.5-2'], comet.groupby('binned')['ptm-total'].sum()['2-2.5'], comet.groupby('binned')['ptm-total'].sum()['2.5-3'], comet.groupby('binned')['ptm-total'].sum()['3-3.5'], comet.groupby('binned')['ptm-total'].sum()['3.5-4'], comet.groupby('binned')['ptm-total'].sum()['4-4.5'], comet.groupby('binned')['ptm-total'].sum()['4.5-5'], comet.groupby('binned')['ptm-total'].sum()['5-5.5'], comet.groupby('binned')['ptm-total'].sum()['5.5-6'], comet.groupby('binned')['ptm-total'].sum()['6-6.5'], comet.groupby('binned')['ptm-total'].sum()['6.5-7'], comet.groupby('binned')['ptm-total'].sum()['7-7.5'], comet.groupby('binned')['ptm-total'].sum()['7.5-8'], comet.groupby('binned')['ptm-total'].sum()['8-8.5'], comet.groupby('binned')['ptm-total'].sum()['8.5-9']],
        'Cys carb.': [comet.groupby('binned')['c-carb'].sum()['0-0.5'], comet.groupby('binned')['c-carb'].sum()['0.5-1'], comet.groupby('binned')['c-carb'].sum()['1-1.5'], comet.groupby('binned')['c-carb'].sum()['1.5-2'], comet.groupby('binned')['c-carb'].sum()['2-2.5'], comet.groupby('binned')['c-carb'].sum()['2.5-3'], comet.groupby('binned')['c-carb'].sum()['3-3.5'], comet.groupby('binned')['c-carb'].sum()['3.5-4'], comet.groupby('binned')['c-carb'].sum()['4-4.5'], comet.groupby('binned')['c-carb'].sum()['4.5-5'], comet.groupby('binned')['c-carb'].sum()['5-5.5'], comet.groupby('binned')['c-carb'].sum()['5.5-6'], comet.groupby('binned')['c-carb'].sum()['6-6.5'], comet.groupby('binned')['c-carb'].sum()['6.5-7'], comet.groupby('binned')['c-carb'].sum()['7-7.5'], comet.groupby('binned')['c-carb'].sum()['7.5-8'], comet.groupby('binned')['c-carb'].sum()['8-8.5'], comet.groupby('binned')['c-carb'].sum()['8.5-9']],
        'Met oxi.': [comet.groupby('binned')['m-oxid'].sum()['0-0.5'], comet.groupby('binned')['m-oxid'].sum()['0.5-1'], comet.groupby('binned')['m-oxid'].sum()['1-1.5'], comet.groupby('binned')['m-oxid'].sum()['1.5-2'], comet.groupby('binned')['m-oxid'].sum()['2-2.5'], comet.groupby('binned')['m-oxid'].sum()['2.5-3'], comet.groupby('binned')['m-oxid'].sum()['3-3.5'], comet.groupby('binned')['m-oxid'].sum()['3.5-4'], comet.groupby('binned')['m-oxid'].sum()['4-4.5'], comet.groupby('binned')['m-oxid'].sum()['4.5-5'], comet.groupby('binned')['m-oxid'].sum()['5-5.5'], comet.groupby('binned')['m-oxid'].sum()['5.5-6'], comet.groupby('binned')['m-oxid'].sum()['6-6.5'], comet.groupby('binned')['m-oxid'].sum()['6.5-7'], comet.groupby('binned')['m-oxid'].sum()['7-7.5'], comet.groupby('binned')['m-oxid'].sum()['7.5-8'], comet.groupby('binned')['m-oxid'].sum()['8-8.5'], comet.groupby('binned')['m-oxid'].sum()['8.5-9']],
        'Asp deam.': [comet.groupby('binned')['n-deam'].sum()['0-0.5'], comet.groupby('binned')['n-deam'].sum()['0.5-1'], comet.groupby('binned')['n-deam'].sum()['1-1.5'], comet.groupby('binned')['n-deam'].sum()['1.5-2'], comet.groupby('binned')['n-deam'].sum()['2-2.5'], comet.groupby('binned')['n-deam'].sum()['2.5-3'], comet.groupby('binned')['n-deam'].sum()['3-3.5'], comet.groupby('binned')['n-deam'].sum()['3.5-4'], comet.groupby('binned')['n-deam'].sum()['4-4.5'], comet.groupby('binned')['n-deam'].sum()['4.5-5'], comet.groupby('binned')['n-deam'].sum()['5-5.5'], comet.groupby('binned')['n-deam'].sum()['5.5-6'], comet.groupby('binned')['n-deam'].sum()['6-6.5'], comet.groupby('binned')['n-deam'].sum()['6.5-7'], comet.groupby('binned')['n-deam'].sum()['7-7.5'], comet.groupby('binned')['n-deam'].sum()['7.5-8'], comet.groupby('binned')['n-deam'].sum()['8-8.5'], comet.groupby('binned')['n-deam'].sum()['8.5-9']],
        'Glut deam.': [comet.groupby('binned')['q-deam'].sum()['0-0.5'], comet.groupby('binned')['q-deam'].sum()['0.5-1'], comet.groupby('binned')['q-deam'].sum()['1-1.5'], comet.groupby('binned')['q-deam'].sum()['1.5-2'], comet.groupby('binned')['q-deam'].sum()['2-2.5'], comet.groupby('binned')['q-deam'].sum()['2.5-3'], comet.groupby('binned')['q-deam'].sum()['3-3.5'], comet.groupby('binned')['q-deam'].sum()['3.5-4'], comet.groupby('binned')['q-deam'].sum()['4-4.5'], comet.groupby('binned')['q-deam'].sum()['4.5-5'], comet.groupby('binned')['q-deam'].sum()['5-5.5'], comet.groupby('binned')['q-deam'].sum()['5.5-6'], comet.groupby('binned')['q-deam'].sum()['6-6.5'], comet.groupby('binned')['q-deam'].sum()['6.5-7'], comet.groupby('binned')['q-deam'].sum()['7-7.5'], comet.groupby('binned')['q-deam'].sum()['7.5-8'], comet.groupby('binned')['q-deam'].sum()['8-8.5'], comet.groupby('binned')['q-deam'].sum()['8.5-9']],
        'Lys iron': [comet.groupby('binned')['k-iron'].sum()['0-0.5'], comet.groupby('binned')['k-iron'].sum()['0.5-1'], comet.groupby('binned')['k-iron'].sum()['1-1.5'], comet.groupby('binned')['k-iron'].sum()['1.5-2'], comet.groupby('binned')['k-iron'].sum()['2-2.5'], comet.groupby('binned')['k-iron'].sum()['2.5-3'], comet.groupby('binned')['k-iron'].sum()['3-3.5'], comet.groupby('binned')['k-iron'].sum()['3.5-4'], comet.groupby('binned')['k-iron'].sum()['4-4.5'], comet.groupby('binned')['k-iron'].sum()['4.5-5'], comet.groupby('binned')['k-iron'].sum()['5-5.5'], comet.groupby('binned')['k-iron'].sum()['5.5-6'], comet.groupby('binned')['k-iron'].sum()['6-6.5'], comet.groupby('binned')['k-iron'].sum()['6.5-7'], comet.groupby('binned')['k-iron'].sum()['7-7.5'], comet.groupby('binned')['k-iron'].sum()['7.5-8'], comet.groupby('binned')['k-iron'].sum()['8-8.5'], comet.groupby('binned')['k-iron'].sum()['8.5-9']],
        'Lys meth.': [comet.groupby('binned')['k-meth'].sum()['0-0.5'], comet.groupby('binned')['k-meth'].sum()['0.5-1'], comet.groupby('binned')['k-meth'].sum()['1-1.5'], comet.groupby('binned')['k-meth'].sum()['1.5-2'], comet.groupby('binned')['k-meth'].sum()['2-2.5'], comet.groupby('binned')['k-meth'].sum()['2.5-3'], comet.groupby('binned')['k-meth'].sum()['3-3.5'], comet.groupby('binned')['k-meth'].sum()['3.5-4'], comet.groupby('binned')['k-meth'].sum()['4-4.5'], comet.groupby('binned')['k-meth'].sum()['4.5-5'], comet.groupby('binned')['k-meth'].sum()['5-5.5'], comet.groupby('binned')['k-meth'].sum()['5.5-6'], comet.groupby('binned')['k-meth'].sum()['6-6.5'], comet.groupby('binned')['k-meth'].sum()['6.5-7'], comet.groupby('binned')['k-meth'].sum()['7-7.5'], comet.groupby('binned')['k-meth'].sum()['7.5-8'], comet.groupby('binned')['k-meth'].sum()['8-8.5'], comet.groupby('binned')['k-meth'].sum()['8.5-9']],
        'Arg meth.': [comet.groupby('binned')['r-meth'].sum()['0-0.5'], comet.groupby('binned')['r-meth'].sum()['0.5-1'], comet.groupby('binned')['r-meth'].sum()['1-1.5'], comet.groupby('binned')['r-meth'].sum()['1.5-2'], comet.groupby('binned')['r-meth'].sum()['2-2.5'], comet.groupby('binned')['r-meth'].sum()['2.5-3'], comet.groupby('binned')['r-meth'].sum()['3-3.5'], comet.groupby('binned')['r-meth'].sum()['3.5-4'], comet.groupby('binned')['r-meth'].sum()['4-4.5'], comet.groupby('binned')['r-meth'].sum()['4.5-5'], comet.groupby('binned')['r-meth'].sum()['5-5.5'], comet.groupby('binned')['r-meth'].sum()['5.5-6'], comet.groupby('binned')['r-meth'].sum()['6-6.5'], comet.groupby('binned')['r-meth'].sum()['6.5-7'], comet.groupby('binned')['r-meth'].sum()['7-7.5'], comet.groupby('binned')['r-meth'].sum()['7.5-8'], comet.groupby('binned')['r-meth'].sum()['8-8.5'], comet.groupby('binned')['r-meth'].sum()['8.5-9']]
        }

cometbin = pd.DataFrame(data, columns=['Total PTMs','Cys carb.','Met oxi.','Asp deam.', 'Glut deam.', 'Lys iron', 'Lys meth.', 'Arg meth.'], index=index)

ax1 = cometbin.plot.bar(y='Total PTMs', rot=45)
ax1.set_title('Total PTMs')

ax2 = cometbin.plot.bar(y='Cys carb.', rot=45)
ax2.set_title('Cysteine carbamidomethylation')

ax3 = cometbin.plot.bar(y='Met oxi.', rot=45)
ax3.set_title('Methionine oxidation')

ax4 = cometbin.plot.bar(y='Asp deam.', rot=45)
ax4.set_title('Asparagine deamidation')

ax5 = cometbin.plot.bar(y='Glut deam.', rot=45)
ax5.set_title('Glutamine deamidation')

ax6 = cometbin.plot.bar(y='Lys iron', rot=45)
ax6.set_title('Lysine iron adduct')

ax7 = cometbin.plot.bar(y='Lys meth.', rot=45)
ax7.set_title('Lysine methylation')

ax7 = cometbin.plot.bar(y='Arg meth.', rot=45)
ax7.set_title('Arginine methylation')

#plt.savefig('/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pronovo-2020-ptm/MED4_trypsin1_Comet_PTMopt.png')

In [None]:
# histogram of stripped peptide lengths

#plt.rcdefaults()
#fig, ax = plt.subplots()

comet.plot(y='stripped length', kind = 'hist', bins = 20, title = 'Peptide length')
plt

In [None]:
# histogram of total peptide amounts

#plt.rcdefaults()
#fig, ax = plt.subplots()

comet.plot(y='ptm-total', kind = 'hist', bins = 20, title = 'PTMs/sequence')
plt

In [None]:
from matplotlib import pyplot
# density plot of xcorr vs length (idea from https://python-graph-gallery.com/85-density-plot-with-matplotlib/)

# read in data
x = comet['xcorr']
y = comet['stripped length']
 
# evaluate a gaussian kernel density estimation (KDE) on a regular grid of nbins x nbins over data extents
nbins=300
k = kde.gaussian_kde([x,y])
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
 
# make the plot
plt.pcolormesh(xi, yi, zi.reshape(xi.shape))
plt.colorbar()
plt.show() 

In [None]:
# plot with density of xcorr vs length and xcorr vs total ptms
plt.figure()

# read in data
x = comet['xcorr']
y = comet['stripped length']

a = comet['xcorr']
b = comet['ptm-total']

# evaluate a gaussian kernel density estimation (KDE) on a regular grid of nbins x nbins over data extents
nbins=300
k = kde.gaussian_kde([x,y])
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))

c = kde.gaussian_kde([a,b])
ai, bi = np.mgrid[a.min():a.max():nbins*1j, b.min():b.max():nbins*1j]
di = k(np.vstack([ai.flatten(), bi.flatten()]))

# density plot of length vs xcorr
plt.subplot(221)
plt.pcolormesh(xi, yi, zi.reshape(xi.shape))
plt.title('logit')
plt.colorbar()

# density plot of xcorr vs ptms
plt.subplot(222)
plt.pcolormesh(ai, bi, di.reshape(ai.shape))
plt.title('Combined MED4 Comet XCorr vs total PTMs')
plt.colorbar()


plt.show()

### XCorr > 2.5 - Visualizing the results:

In [None]:
# making evenly spaced bins for the Xcorr data based on the min and max, called above
bins = [2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9]
labels = ['2.5-3', '3-3.5', '3.5-4', '4-4.5', '4.5-5', '5-5.5', '5.5-6', '6-6.5', '6.5-7', '7-7.5', '7.5-8', '8-8.5', '8.5-9']

# use pandas cut function to do the binning itself
comet25['binned'] = pd.cut(comet25['xcorr'], bins=bins, labels=labels)

# bar plots of binned PTM data

index = ['2.5-3', '3-3.5', '3.5-4', '4-4.5', '4.5-5', '5-5.5', '5.5-6', '6-6.5', '6.5-7', '7-7.5', '7.5-8', '8-8.5', '8.5-9']
data = {'Total PTMs': [comet25.groupby('binned')['ptm-total'].sum()['2.5-3'], comet25.groupby('binned')['ptm-total'].sum()['3-3.5'], comet25.groupby('binned')['ptm-total'].sum()['3.5-4'], comet25.groupby('binned')['ptm-total'].sum()['4-4.5'], comet25.groupby('binned')['ptm-total'].sum()['4.5-5'], comet25.groupby('binned')['ptm-total'].sum()['5-5.5'], comet25.groupby('binned')['ptm-total'].sum()['5.5-6'], comet25.groupby('binned')['ptm-total'].sum()['6-6.5'], comet25.groupby('binned')['ptm-total'].sum()['6.5-7'], comet25.groupby('binned')['ptm-total'].sum()['7-7.5'], comet25.groupby('binned')['ptm-total'].sum()['7.5-8'], comet25.groupby('binned')['ptm-total'].sum()['8-8.5'], comet25.groupby('binned')['ptm-total'].sum()['8.5-9']],
        'Cys carb.': [comet25.groupby('binned')['c-carb'].sum()['2.5-3'], comet25.groupby('binned')['c-carb'].sum()['3-3.5'], comet25.groupby('binned')['c-carb'].sum()['3.5-4'], comet25.groupby('binned')['c-carb'].sum()['4-4.5'], comet25.groupby('binned')['c-carb'].sum()['4.5-5'], comet25.groupby('binned')['c-carb'].sum()['5-5.5'], comet25.groupby('binned')['c-carb'].sum()['5.5-6'], comet25.groupby('binned')['c-carb'].sum()['6-6.5'], comet25.groupby('binned')['c-carb'].sum()['6.5-7'], comet25.groupby('binned')['c-carb'].sum()['7-7.5'], comet25.groupby('binned')['c-carb'].sum()['7.5-8'], comet25.groupby('binned')['c-carb'].sum()['8-8.5'], comet25.groupby('binned')['c-carb'].sum()['8.5-9']],
        'Met oxi.': [comet25.groupby('binned')['m-oxid'].sum()['2.5-3'], comet25.groupby('binned')['m-oxid'].sum()['3-3.5'], comet25.groupby('binned')['m-oxid'].sum()['3.5-4'], comet25.groupby('binned')['m-oxid'].sum()['4-4.5'], comet25.groupby('binned')['m-oxid'].sum()['4.5-5'], comet25.groupby('binned')['m-oxid'].sum()['5-5.5'], comet25.groupby('binned')['m-oxid'].sum()['5.5-6'], comet25.groupby('binned')['m-oxid'].sum()['6-6.5'], comet25.groupby('binned')['m-oxid'].sum()['6.5-7'], comet25.groupby('binned')['m-oxid'].sum()['7-7.5'], comet25.groupby('binned')['m-oxid'].sum()['7.5-8'], comet25.groupby('binned')['m-oxid'].sum()['8-8.5'], comet25.groupby('binned')['m-oxid'].sum()['8.5-9']],
        'Asp deam.': [comet25.groupby('binned')['n-deam'].sum()['2.5-3'], comet25.groupby('binned')['n-deam'].sum()['3-3.5'], comet25.groupby('binned')['n-deam'].sum()['3.5-4'], comet25.groupby('binned')['n-deam'].sum()['4-4.5'], comet25.groupby('binned')['n-deam'].sum()['4.5-5'], comet25.groupby('binned')['n-deam'].sum()['5-5.5'], comet25.groupby('binned')['n-deam'].sum()['5.5-6'], comet25.groupby('binned')['n-deam'].sum()['6-6.5'], comet25.groupby('binned')['n-deam'].sum()['6.5-7'], comet25.groupby('binned')['n-deam'].sum()['7-7.5'], comet25.groupby('binned')['n-deam'].sum()['7.5-8'], comet25.groupby('binned')['n-deam'].sum()['8-8.5'], comet25.groupby('binned')['n-deam'].sum()['8.5-9']],
        'Glut deam.': [comet25.groupby('binned')['q-deam'].sum()['2.5-3'], comet25.groupby('binned')['q-deam'].sum()['3-3.5'], comet25.groupby('binned')['q-deam'].sum()['3.5-4'], comet25.groupby('binned')['q-deam'].sum()['4-4.5'], comet25.groupby('binned')['q-deam'].sum()['4.5-5'], comet25.groupby('binned')['q-deam'].sum()['5-5.5'], comet25.groupby('binned')['q-deam'].sum()['5.5-6'], comet25.groupby('binned')['q-deam'].sum()['6-6.5'], comet25.groupby('binned')['q-deam'].sum()['6.5-7'], comet25.groupby('binned')['q-deam'].sum()['7-7.5'], comet25.groupby('binned')['q-deam'].sum()['7.5-8'], comet25.groupby('binned')['q-deam'].sum()['8-8.5'], comet25.groupby('binned')['q-deam'].sum()['8.5-9']],
        'Lys iron': [comet25.groupby('binned')['k-iron'].sum()['2.5-3'], comet25.groupby('binned')['k-iron'].sum()['3-3.5'], comet25.groupby('binned')['k-iron'].sum()['3.5-4'], comet25.groupby('binned')['k-iron'].sum()['4-4.5'], comet25.groupby('binned')['k-iron'].sum()['4.5-5'], comet25.groupby('binned')['k-iron'].sum()['5-5.5'], comet25.groupby('binned')['k-iron'].sum()['5.5-6'], comet25.groupby('binned')['k-iron'].sum()['6-6.5'], comet25.groupby('binned')['k-iron'].sum()['6.5-7'], comet25.groupby('binned')['k-iron'].sum()['7-7.5'], comet25.groupby('binned')['k-iron'].sum()['7.5-8'], comet25.groupby('binned')['k-iron'].sum()['8-8.5'], comet25.groupby('binned')['k-iron'].sum()['8.5-9']],
        'Lys meth.': [comet25.groupby('binned')['k-meth'].sum()['2.5-3'], comet25.groupby('binned')['k-meth'].sum()['3-3.5'], comet25.groupby('binned')['k-meth'].sum()['3.5-4'], comet25.groupby('binned')['k-meth'].sum()['4-4.5'], comet25.groupby('binned')['k-meth'].sum()['4.5-5'], comet25.groupby('binned')['k-meth'].sum()['5-5.5'], comet25.groupby('binned')['k-meth'].sum()['5.5-6'], comet25.groupby('binned')['k-meth'].sum()['6-6.5'], comet25.groupby('binned')['k-meth'].sum()['6.5-7'], comet25.groupby('binned')['k-meth'].sum()['7-7.5'], comet25.groupby('binned')['k-meth'].sum()['7.5-8'], comet25.groupby('binned')['k-meth'].sum()['8-8.5'], comet25.groupby('binned')['k-meth'].sum()['8.5-9']],
        'Arg meth.': [comet25.groupby('binned')['r-meth'].sum()['2.5-3'], comet25.groupby('binned')['r-meth'].sum()['3-3.5'], comet25.groupby('binned')['r-meth'].sum()['3.5-4'], comet25.groupby('binned')['r-meth'].sum()['4-4.5'], comet25.groupby('binned')['r-meth'].sum()['4.5-5'], comet25.groupby('binned')['r-meth'].sum()['5-5.5'], comet25.groupby('binned')['r-meth'].sum()['5.5-6'], comet25.groupby('binned')['r-meth'].sum()['6-6.5'], comet25.groupby('binned')['r-meth'].sum()['6.5-7'], comet25.groupby('binned')['r-meth'].sum()['7-7.5'], comet25.groupby('binned')['r-meth'].sum()['7.5-8'], comet25.groupby('binned')['r-meth'].sum()['8-8.5'], comet25.groupby('binned')['r-meth'].sum()['8.5-9']]
        }

comet25bin = pd.DataFrame(data, columns=['Total PTMs','Cys carb.','Met oxi.','Asp deam.', 'Glut deam.', 'Lys iron', 'Lys meth.', 'Arg meth.'], index=index)

ax1 = comet25bin.plot.bar(y='Total PTMs', rot=45)
ax1.set_title('Total PTMs')

ax2 = comet25bin.plot.bar(y='Cys carb.', rot=45)
ax2.set_title('Cysteine carbamidomethylation')

ax3 = comet25bin.plot.bar(y='Met oxi.', rot=45)
ax3.set_title('Methionine oxidation')

ax4 = comet25bin.plot.bar(y='Asp deam.', rot=45)
ax4.set_title('Asparagine deamidation')

ax5 = comet25bin.plot.bar(y='Glut deam.', rot=45)
ax5.set_title('Glutamine deamidation')

ax6 = comet25bin.plot.bar(y='Lys iron', rot=45)
ax6.set_title('Lysine iron adduct')

ax7 = comet25bin.plot.bar(y='Lys meth.', rot=45)
ax7.set_title('Lysine methylation')

ax7 = comet25bin.plot.bar(y='Arg meth.', rot=45)
ax7.set_title('Arginine methylation')

#plt.savefig('/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pronovo-2020-ptm/MED4_trypsin1_comet25_PTMopt.png')

In [None]:
# histogram of stripped peptide lengths

#plt.rcdefaults()
#fig, ax = plt.subplots()

comet25.plot(y='stripped length', kind = 'hist', bins = 20, title = 'Peptide length')
plt

In [None]:
# histogram of total peptide amounts

#plt.rcdefaults()
#fig, ax = plt.subplots()

comet25.plot(y='ptm-total', kind = 'hist', bins = 20, title = 'PTMs/sequence')
plt

In [None]:
from matplotlib import pyplot
# density plot of xcorr vs length (idea from https://python-graph-gallery.com/85-density-plot-with-matplotlib/)

# read in data
x = comet25['xcorr']
y = comet25['stripped length']
 
# evaluate a gaussian kernel density estimation (KDE) on a regular grid of nbins x nbins over data extents
nbins=300
k = kde.gaussian_kde([x,y])
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
 
# make the plot
plt.pcolormesh(xi, yi, zi.reshape(xi.shape))
plt.colorbar()
plt.show() 

In [None]:
# now we have the stripped peptide csvs and txt files in the same data dir:
!ls /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/