### Manipulation of Trans Proteomic Pipeline (TPP) Comet database search results of *Prochlorococus MED4* LC-MS/MS data using python.

Starting with: 

- Comet output (.xlxs and .csv) of PTM-optimized database searches, sorted by XCorr (descending) and run through XInteract to extract precursor intensities and protein descriptions mapped from the search database.

Ending with:

- Files with stripped (no PTMs or tryptic ends) peptide lists and
- Columns with #'s of each modification in every sequence
- Column with stripped peptide lengths (# amino acids)
- Histogram of sequence lengths
- Bar plots of PTM occurance

### To use:

#### 1. Change the input file name in *IN 4*
#### 2. Change output file name in *IN 6*, *IN 7*, *IN 8*

For technical duplicates, I exported Comet search results as both Excel files and as CSVs into my ETNP 2017 git repo:

Also, when running through XInteract in the TPP, I combined the duplicate injections into a single PepXML file which I exported as an xls file and converted to a csv.

In [1]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP


In [2]:
ls

RAL4_MED2_combine_Comet2.5Xcorr_proteins.txt
RAL4_MED2_combine_Comet3Xcorr_proteins.txt
RAL4_MED2_trypsin_1_PTMopt_Comet.csv
RAL4_MED2_trypsin_1_PTMopt_Comet_stripped.csv
RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_peptides_2.5XCorr.txt
RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_peptides.txt
RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_work.ods
RAL4_MED2_trypsin_1_PTMopt_Comet.xlsx
RAL4_MED2_trypsin_1_PTMopt_PepProp90.csv
RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped.csv
RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped_peptides
RAL4_MED2_trypsin_1_PTMopt_PepProp90.xlsx
RAL4_MED2_trypsin_2_PTMopt_Comet.csv
RAL4_MED2_trypsin_2_PTMopt_Comet_stripped.csv
RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_peptides_2.5XCorr.txt
RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_peptides.txt
RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_work.ods
RAL4_MED2_trypsin_2_PTMopt_Comet.xlsx
RAL4_MED2_trypsin_2_PTMopt_PepProp90.csv
RAL4_MED2_trypsin_2_PTMopt_PepProp90_stripped.csv
RAL4_MED2_trypsin_2_PTMopt_PepP

In [3]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#from matplotlib import pyplot
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [66]:
# formerly, read in the replicates without precursor intensities and protein descriptions:

# read the CSVs of each replicate into a datadrame we name 'comet' using the pandas read_csv function
##comet1 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_Comet.csv")
##comet2 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_2_PTMopt_Comet.csv")

##frames = [comet1, comet2]

# concatenate dataframes
## cometdup = pd.concat(frames, sort=False)

# now, reading in the combined csv that contains precursor intensities and protein descriptions
cometdup = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/ral_95_med2_trypsin_combine_quant_concat_pepxml.csv", index_col='spectrum')

# some of the protein description text got splinched into the precursor intensity column. 
# to separate the text from the numbers, use regex exressions to retains either all but numbers or all but text
# into new columns, 'Subunit', containing all misc protein desc., and the actual 'Precursor Intensity'
cometdup['Subunit'] = cometdup['preint'].str.replace('\d+', '')
cometdup['Precursor Intensity'] = cometdup['preint'].str.replace('\D', '')

# remove redundant rows
comet = pd.DataFrame.drop_duplicates(cometdup)

print("# redundant Comet peptides in combined dataframe", len(cometdup))
print("# nonredundant Comet peptides in combined dataframe", len(comet))

comet.head()

# redundant Comet peptides in combined dataframe 115289
# nonredundant Comet peptides in combined dataframe 115289


Unnamed: 0_level_0,xcorr,deltacn,expect,ions,peptide,protein,calc_neutral_pep_mass,protein_descr,preint,Subunit,Precursor Intensity
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,8.78,1.0,5.07e-12,12/204,K.LAIDDSSINLDQVDYIN[115.03]AHGTSTTANDKNETSAIK.S,PMM1609,3734.7759,| fabF | 3-oxoacyl-[acyl-carrier-protein] synt...,5246400,,5246400
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,8.768,1.0,1.42e-09,22/156,K.LFADENHLSPAVTAIQIEDIDAEQFRK.N,PMM0035,3069.5407,| DHSS | soluble hydrogenase small subunit,2967010,,2967010
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,8.599,0.583,2.14e-12,22/156,R.SGLQNAASIAGM[147.04]VLTTEC[160.03]IVADLPEKK.D,PMM1436,2831.4409,| groEL | chaperonin GroEL,7136140,,7136140
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,8.578,1.0,9.31e-10,18/204,K.LAIDDSSIN[115.03]LDQVDYIN[115.03]AHGTSTTANDK...,PMM1609,3735.7599,| fabF | 3-oxoacyl-[acyl-carrier-protein] synt...,6552550,,6552550
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,8.469,0.646,3.43e-14,24/156,R.SGLQNAASIAGM[147.04]VLTTEC[160.03]IVADLPEKK.D,PMM1436,2831.4409,| groEL | chaperonin GroEL,8293220,,8293220


The peptide column has the residues before and after the tryptic terminii as well as masses of modified residues (e.g., 160.03 Da for carbamidomethylated cysteine). We want to make new columns with all that information and make a column with only the 'stripped' peptide sequence that's just amino acids - this we can then align against other sequences, for example. 

Modified residues were allowed for:

- fixed carbamodimethylation of cysteine 57.021464 C
- varialbe oxidation of methionine: 15.9949 M
- variable deamidation of asparagine, glumatine: 0.984016 NQ
- variable iron cation on lysine: 54.010565 K
- variable methylation of lysine and arginine: 14.015650 KR

We'll then write this manipulated dataframe to a new file.

In [67]:
# get rid of rows where the xcorr is unavailable (usually 3 or so)
comet = comet[comet.xcorr != '[unavailable]']

# use str.strip with indexing by str[0] to add a column with the peptide's left terminus
comet['L terminus'] = comet['peptide'].astype(str).str[0]

# use str.strip with indexing by str[-1] to add a column with the peptide's left terminus
comet['R terminus'] = comet['peptide'].str.strip().str[-1]

# use a count function to enumerate the # of A's (alanines) in each peptide
comet['A'] = comet['peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
comet['C'] = comet['peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
comet['D'] = comet['peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
comet['E'] = comet['peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
comet['F'] = comet['peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
comet['G'] = comet['peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
comet['H'] = comet['peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in comet output, there will be no isoleucines (they're lumped in with leucines)
comet['I'] = comet['peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
comet['K'] = comet['peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
comet['L'] = comet['peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
comet['M'] = comet['peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
comet['N'] = comet['peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
comet['P'] = comet['peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
comet['Q'] = comet['peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
comet['R'] = comet['peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
comet['S'] = comet['peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
comet['T'] = comet['peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
comet['V'] = comet['peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
comet['W'] = comet['peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
comet['Y'] = comet['peptide'].str.count("Y")

# use a count function to enumerate the # of ccarbamidomethylated C's in each peptide
comet['c-carb'] = comet['peptide'].str.count("160.03")

# use a count function to enumerate the # of oxidized M's in each peptide
comet['m-oxid'] = comet['peptide'].str.count("147.04")

# use a count function to enumerate the # of deamidated N's in each peptide
comet['n-deam'] = comet['peptide'].str.count("115.03")

# use a count function to enumerate the # of deamidated Q's in each peptide
comet['q-deam'] = comet['peptide'].str.count("129.04")

# use a count function to enumerate the # of iron adducted K's in each peptide
comet['k-iron'] = comet['peptide'].str.count("182.11")

# use a count function to enumerate the # of methylated K's in each peptide
comet['k-meth'] = comet['peptide'].str.count("142.11")

# use a count function to enumerate the # of methylated R's in each peptide
comet['r-meth'] = comet['peptide'].str.count("170.12")

# create a column with 'stripped' peptide sequences using strip
comet['stripped peptide'] = comet['peptide'].str[2:].str[:-2].str.replace(r"\[.*\]","")

# add a column with the stripped peptide length (number of AAs)
comet['stripped length'] = comet['stripped peptide'].apply(len)

# total the number of modifications in sequence
comet['ptm-total'] = comet['c-carb'] + comet['m-oxid'] + comet['n-deam'] + comet['q-deam'] + comet['k-iron'] + comet['k-meth'] + comet['r-meth']

# write modified dataframe to new txt file, same name + 'stripped'
comet.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_Comet_stripped.csv")


# check out the results
comet.head()

  res_values = method(rvalues)


Unnamed: 0_level_0,xcorr,deltacn,expect,ions,peptide,protein,calc_neutral_pep_mass,protein_descr,preint,Subunit,...,c-carb,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length,ptm-total
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,8.78,1.0,5.07e-12,12/204,K.LAIDDSSINLDQVDYIN[115.03]AHGTSTTANDKNETSAIK.S,PMM1609,3734.7759,| fabF | 3-oxoacyl-[acyl-carrier-protein] synt...,5246400,,...,0,0,1,0,0,0,0,LAIDDSSINLDQVDYINAHGTSTTANDKNETSAIK,35,1
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,8.768,1.0,1.42e-09,22/156,K.LFADENHLSPAVTAIQIEDIDAEQFRK.N,PMM0035,3069.5407,| DHSS | soluble hydrogenase small subunit,2967010,,...,0,0,0,0,0,0,0,LFADENHLSPAVTAIQIEDIDAEQFRK,27,0
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,8.599,0.583,2.14e-12,22/156,R.SGLQNAASIAGM[147.04]VLTTEC[160.03]IVADLPEKK.D,PMM1436,2831.4409,| groEL | chaperonin GroEL,7136140,,...,1,1,0,0,0,0,0,SGLQNAASIAGMIVADLPEKK,21,2
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,8.578,1.0,9.31e-10,18/204,K.LAIDDSSIN[115.03]LDQVDYIN[115.03]AHGTSTTANDK...,PMM1609,3735.7599,| fabF | 3-oxoacyl-[acyl-carrier-protein] synt...,6552550,,...,0,0,2,0,0,0,0,LAIDDSSINAHGTSTTANDKNETSAIK,27,2
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,8.469,0.646,3.43e-14,24/156,R.SGLQNAASIAGM[147.04]VLTTEC[160.03]IVADLPEKK.D,PMM1436,2831.4409,| groEL | chaperonin GroEL,8293220,,...,1,1,0,0,0,0,0,SGLQNAASIAGMIVADLPEKK,21,2


## Calculating the false discovery rate (% FDR)

### Filtering PSMs > a selected XCorr value and exporting peptides

In [68]:
# Let's separate out the decoy hits from the good ones

cometpmm = comet[~comet['protein'].str.contains("DECOY")]
cometdec = comet[comet['protein'].str.contains("DECOY")]

# how many PSM that are only PMM (proteins in the database)?

print("# real Comet PSMs", len(cometpmm))

# compared to how many PSMs containing decoys?

print("# decoy Comet PSMs", len(cometdec))

# calculate the bulk FDR (all PSMs so let's not beat ourselves up)

r = len(cometpmm)
d = len(cometdec)

FDR = d/r*100

print("False discovery rate = ", FDR)

# real Comet PSMs 80669
# decoy Comet PSMs 34620
False discovery rate =  42.91611399670258


In [69]:
# keep only peptides  >2.5 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
comet['xcorr'] = pd.to_numeric(comet['xcorr'])

comet25 = comet.loc[comet['xcorr'] >= 2.5]

# What's the FDR?

# Let's separate out the decoy hits from the good ones

cometpmm25 = comet25[~comet25['protein'].str.contains("DECOY")]
cometdec25 = comet25[comet25['protein'].str.contains("DECOY")]

# how many PSM that are only PMM (proteins in the database)?

print("# real Comet PSMs", len(cometpmm25))

# compared to how many PSMs containing decoys?

print("# decoy Comet PSMs", len(cometdec25))

# calculate the FDR 

r = len(cometpmm25)
d = len(cometdec25)

FDR = d/(d+r)*100

print("False discovery rate = ", FDR)

# real Comet PSMs 35931
# decoy Comet PSMs 1270
False discovery rate =  3.4138867234751755


In [70]:
# keep only peptides  >3 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
comet['xcorr'] = pd.to_numeric(comet['xcorr'])

comet3 = comet.loc[comet['xcorr'] >= 3]

# What's the FDR?

# Let's separate out the decoy hits from the good ones

cometpmm3 = comet3[~comet3['protein'].str.contains("DECOY")]
cometdec3 = comet3[comet3['protein'].str.contains("DECOY")]

# how many PSM that are only PMM (proteins in the database)?

print("# real Comet PSMs", len(cometpmm3))

# compared to how many PSMs containing decoys?

print("# decoy Comet PSMs", len(cometdec3))

# calculate the FDR 

r = len(cometpmm3)
d = len(cometdec3)

FDR = d/(d+r)*100

print("False discovery rate = ", FDR)

# real Comet PSMs 26923
# decoy Comet PSMs 258
False discovery rate =  0.9491924506088812


### Exporting peptides from a XCorr > 2.5 and XCorr > 3 thresholds:

In [71]:
# keep only peptides  >2.5 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
comet['xcorr'] = pd.to_numeric(comet['xcorr'])

comet25 = comet.loc[comet['xcorr'] >= 2.5]

# Let's separate out the decoy hits from the good ones

cometpmm25 = comet25[~comet25['protein'].str.contains("DECOY")]
cometdec25 = comet25[comet25['protein'].str.contains("DECOY")]


# keep only peptide column 
pep25 = cometpmm25[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep25.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_combine_PTMopt_Comet_stripped_peptides_2.5XCorr.txt", header=False, index=False)

# removing redundancy
pep25dedup = pd.DataFrame.drop_duplicates(pep25)

print("# redundant Comet peptides >2.5 XCorr", len(pep25))
print("# nonredundant Comet peptides >2.5 XCOrr", len(pep25dedup))

pep25.head()

# redundant Comet peptides >2.5 XCorr 35931
# nonredundant Comet peptides >2.5 XCOrr 12283


Unnamed: 0_level_0,stripped peptide
spectrum,Unnamed: 1_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,LAIDDSSINLDQVDYINAHGTSTTANDKNETSAIK
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,LFADENHLSPAVTAIQIEDIDAEQFRK
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,SGLQNAASIAGMIVADLPEKK
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,LAIDDSSINAHGTSTTANDKNETSAIK
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,SGLQNAASIAGMIVADLPEKK


In [72]:
# keep only peptides  >3 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
comet['xcorr'] = pd.to_numeric(comet['xcorr'])

comet3 = comet.loc[comet['xcorr'] >= 3]

# Let's separate out the decoy hits from the good ones

cometpmm3 = comet3[~comet3['protein'].str.contains("DECOY")]
cometdec3 = comet3[comet3['protein'].str.contains("DECOY")]

# export the whole table for Comet XCorr > 3
cometpmm3.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_combine_PTMopt_Comet_3XCorr_noDECOY.csv")

# keep only peptide column 
pep3 = cometpmm3[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep3.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_combine_PTMopt_Comet_stripped_peptides_3XCorr.txt", header=False, index=False)

# removing redundancy
pep3dedup = pd.DataFrame.drop_duplicates(pep3)

print("# redundant Comet peptides >3 XCorr", len(pep3))
print("# nonredundant Comet peptides >3 XCOrr", len(pep3dedup))

pep3.head()

# redundant Comet peptides >3 XCorr 26923
# nonredundant Comet peptides >3 XCOrr 9213


Unnamed: 0_level_0,stripped peptide
spectrum,Unnamed: 1_level_1
022016_RAL4_95_MED2_trypsin_2.32518.32518.4,LAIDDSSINLDQVDYINAHGTSTTANDKNETSAIK
022016_RAL4_95_MED2_trypsin_1.41211.41211.4,LFADENHLSPAVTAIQIEDIDAEQFRK
022016_RAL4_95_MED2_trypsin_2.50643.50643.4,SGLQNAASIAGMIVADLPEKK
022016_RAL4_95_MED2_trypsin_1.32793.32793.4,LAIDDSSINAHGTSTTANDKNETSAIK
022016_RAL4_95_MED2_trypsin_1.50751.50751.4,SGLQNAASIAGMIVADLPEKK


## NAAF correction and exporting files with AA and PTM totals:

In [73]:
# for each XCorr threshold, add a column to the decoy-removed df that calculated the precursor intensity 
# over the peptide legth (stripped)
# this is the numerator in the NAAF correction 

cometpmm25['Precursor Intensity'] = pd.to_numeric(cometpmm25['Precursor Intensity'])

# calculate NAAF numerator for each peptide k in Comet > 2.5, no decoys
cometpmm25['NAAF num.'] = cometpmm25['Precursor Intensity'] / cometpmm25['stripped length']

ValueError: Integer out of range. at position 19296

In [None]:
# made a new dataframe that contains the sums of certain columns in the stripped peptide dataframe above
# choosing the XCorr > 2.5 filtered results

index = ['sample total']

data = {'A': comet25['A'].sum(),
        'C': comet25['C'].sum(),
        'D': comet25['D'].sum(),
        'E': comet25['E'].sum(),
        'F': comet25['F'].sum(),
        'G': comet25['G'].sum(),
        'H': comet25['H'].sum(),
        'I': comet25['I'].sum(),
        'K': comet25['K'].sum(),
        'L': comet25['L'].sum(),
        'M': comet25['M'].sum(),
        'N': comet25['N'].sum(),
        'P': comet25['P'].sum(),
        'Q': comet25['Q'].sum(),
        'R': comet25['R'].sum(),
        'S': comet25['S'].sum(),
        'T': comet25['T'].sum(),
        'V': comet25['V'].sum(),
        'W': comet25['W'].sum(),
        'Y': comet25['Y'].sum(),
        'c-carb': comet25['c-carb'].sum(),
        'm-oxid': comet25['m-oxid'].sum(),
        'n-deam': comet25['n-deam'].sum(),
        'q-deam': comet25['q-deam'].sum(),
        'k-iron': comet25['k-iron'].sum(),
        'k-meth': comet25['k-meth'].sum(),
        'r-meth': comet25['r-meth'].sum()
       }

totalcomet25 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', 'q-deam', 'k-iron', 'k-meth', 'r-meth' ], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalcomet25['% C w/ carb.'] = totalcomet25['c-carb'] / totalcomet25['C'] 

# calculate percentage of M's that are oxidized
totalcomet25['% M w/ oxid'] = totalcomet25['m-oxid'] / totalcomet25['M'] 

# calculate percentage of N's that are deamidated
totalcomet25['% N w/ deam'] = totalcomet25['n-deam'] / totalcomet25['N'] 

# calculate percentage of Q's that are deamidated
totalcomet25['% Q w/ deam'] = totalcomet25['q-deam'] / totalcomet25['Q'] 

# calculate percentage of K's that are hydroxylated
totalcomet25['% K w/ iron'] = totalcomet25['k-iron'] / totalcomet25['K'] 

# calculate percentage of K's that are methylated
totalcomet25['% K w/ meth'] = totalcomet25['k-meth'] / totalcomet25['K'] 

# calculate percentage of R's that are methylated
totalcomet25['% R w/ meth'] = totalcomet25['r-meth'] / totalcomet25['R'] 

# write modified dataframe to new txt file, same name + 'stripped'
totalcomet25.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL95_MED2_trypsin_combine_Comet25_totals.csv")

totalcomet25.head()

In [None]:
# made a new dataframe that contains the sums of certain columns in the stripped peptide dataframe above
# choosing the XCorr > 3 filtered results

index = ['sample total']

data = {'A': comet3['A'].sum(),
        'C': comet3['C'].sum(),
        'D': comet3['D'].sum(),
        'E': comet3['E'].sum(),
        'F': comet3['F'].sum(),
        'G': comet3['G'].sum(),
        'H': comet3['H'].sum(),
        'I': comet3['I'].sum(),
        'K': comet3['K'].sum(),
        'L': comet3['L'].sum(),
        'M': comet3['M'].sum(),
        'N': comet3['N'].sum(),
        'P': comet3['P'].sum(),
        'Q': comet3['Q'].sum(),
        'R': comet3['R'].sum(),
        'S': comet3['S'].sum(),
        'T': comet3['T'].sum(),
        'V': comet3['V'].sum(),
        'W': comet3['W'].sum(),
        'Y': comet3['Y'].sum(),
        'c-carb': comet3['c-carb'].sum(),
        'm-oxid': comet3['m-oxid'].sum(),
        'n-deam': comet3['n-deam'].sum(),
        'q-deam': comet3['q-deam'].sum(),
        'k-iron': comet3['k-iron'].sum(),
        'k-meth': comet3['k-meth'].sum(),
        'r-meth': comet3['r-meth'].sum()
       }

totalcomet3 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', 'q-deam', 'k-iron', 'k-meth', 'r-meth' ], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalcomet3['% C w/ carb.'] = totalcomet3['c-carb'] / totalcomet3['C'] 

# calculate percentage of M's that are oxidized
totalcomet3['% M w/ oxid'] = totalcomet3['m-oxid'] / totalcomet3['M'] 

# calculate percentage of N's that are deamidated
totalcomet3['% N w/ deam'] = totalcomet3['n-deam'] / totalcomet3['N'] 

# calculate percentage of Q's that are deamidated
totalcomet3['% Q w/ deam'] = totalcomet3['q-deam'] / totalcomet3['Q'] 

# calculate percentage of K's that are hydroxylated
totalcomet3['% K w/ iron'] = totalcomet3['k-iron'] / totalcomet3['K'] 

# calculate percentage of K's that are methylated
totalcomet3['% K w/ meth'] = totalcomet3['k-meth'] / totalcomet3['K'] 

# calculate percentage of R's that are methylated
totalcomet3['% R w/ meth'] = totalcomet3['r-meth'] / totalcomet3['R'] 

# write modified dataframe to new txt file, same name + 'stripped'
totalcomet3.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL95_MED2_trypsin_combine_Comet3_totals.csv")

totalcomet3.head()

### All XCorr - Visualizing the results:

In [None]:
# making evenly spaced bins for the Xcorr data based on the min and max, called above
bins = [0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9]
labels = ['0-0.5', '0.5-1', '1-1.5', '1.5-2', '2-2.5', '2.5-3', '3-3.5', '3.5-4', '4-4.5', '4.5-5', '5-5.5', '5.5-6', '6-6.5', '6.5-7', '7-7.5', '7.5-8', '8-8.5', '8.5-9']

# use pandas cut function to do the binning itself
comet['binned'] = pd.cut(comet['xcorr'], bins=bins, labels=labels)

# bar plots of binned PTM data

index = ['0-0.5', '0.5-1', '1-1.5', '1.5-2', '2-2.5', '2.5-3', '3-3.5', '3.5-4', '4-4.5', '4.5-5', '5-5.5', '5.5-6', '6-6.5', '6.5-7', '7-7.5', '7.5-8', '8-8.5', '8.5-9']
data = {'Total PTMs': [comet.groupby('binned')['ptm-total'].sum()['0-0.5'], comet.groupby('binned')['ptm-total'].sum()['0.5-1'], comet.groupby('binned')['ptm-total'].sum()['1-1.5'], comet.groupby('binned')['ptm-total'].sum()['1.5-2'], comet.groupby('binned')['ptm-total'].sum()['2-2.5'], comet.groupby('binned')['ptm-total'].sum()['2.5-3'], comet.groupby('binned')['ptm-total'].sum()['3-3.5'], comet.groupby('binned')['ptm-total'].sum()['3.5-4'], comet.groupby('binned')['ptm-total'].sum()['4-4.5'], comet.groupby('binned')['ptm-total'].sum()['4.5-5'], comet.groupby('binned')['ptm-total'].sum()['5-5.5'], comet.groupby('binned')['ptm-total'].sum()['5.5-6'], comet.groupby('binned')['ptm-total'].sum()['6-6.5'], comet.groupby('binned')['ptm-total'].sum()['6.5-7'], comet.groupby('binned')['ptm-total'].sum()['7-7.5'], comet.groupby('binned')['ptm-total'].sum()['7.5-8'], comet.groupby('binned')['ptm-total'].sum()['8-8.5'], comet.groupby('binned')['ptm-total'].sum()['8.5-9']],
        'Cys carb.': [comet.groupby('binned')['c-carb'].sum()['0-0.5'], comet.groupby('binned')['c-carb'].sum()['0.5-1'], comet.groupby('binned')['c-carb'].sum()['1-1.5'], comet.groupby('binned')['c-carb'].sum()['1.5-2'], comet.groupby('binned')['c-carb'].sum()['2-2.5'], comet.groupby('binned')['c-carb'].sum()['2.5-3'], comet.groupby('binned')['c-carb'].sum()['3-3.5'], comet.groupby('binned')['c-carb'].sum()['3.5-4'], comet.groupby('binned')['c-carb'].sum()['4-4.5'], comet.groupby('binned')['c-carb'].sum()['4.5-5'], comet.groupby('binned')['c-carb'].sum()['5-5.5'], comet.groupby('binned')['c-carb'].sum()['5.5-6'], comet.groupby('binned')['c-carb'].sum()['6-6.5'], comet.groupby('binned')['c-carb'].sum()['6.5-7'], comet.groupby('binned')['c-carb'].sum()['7-7.5'], comet.groupby('binned')['c-carb'].sum()['7.5-8'], comet.groupby('binned')['c-carb'].sum()['8-8.5'], comet.groupby('binned')['c-carb'].sum()['8.5-9']],
        'Met oxi.': [comet.groupby('binned')['m-oxid'].sum()['0-0.5'], comet.groupby('binned')['m-oxid'].sum()['0.5-1'], comet.groupby('binned')['m-oxid'].sum()['1-1.5'], comet.groupby('binned')['m-oxid'].sum()['1.5-2'], comet.groupby('binned')['m-oxid'].sum()['2-2.5'], comet.groupby('binned')['m-oxid'].sum()['2.5-3'], comet.groupby('binned')['m-oxid'].sum()['3-3.5'], comet.groupby('binned')['m-oxid'].sum()['3.5-4'], comet.groupby('binned')['m-oxid'].sum()['4-4.5'], comet.groupby('binned')['m-oxid'].sum()['4.5-5'], comet.groupby('binned')['m-oxid'].sum()['5-5.5'], comet.groupby('binned')['m-oxid'].sum()['5.5-6'], comet.groupby('binned')['m-oxid'].sum()['6-6.5'], comet.groupby('binned')['m-oxid'].sum()['6.5-7'], comet.groupby('binned')['m-oxid'].sum()['7-7.5'], comet.groupby('binned')['m-oxid'].sum()['7.5-8'], comet.groupby('binned')['m-oxid'].sum()['8-8.5'], comet.groupby('binned')['m-oxid'].sum()['8.5-9']],
        'Asp deam.': [comet.groupby('binned')['n-deam'].sum()['0-0.5'], comet.groupby('binned')['n-deam'].sum()['0.5-1'], comet.groupby('binned')['n-deam'].sum()['1-1.5'], comet.groupby('binned')['n-deam'].sum()['1.5-2'], comet.groupby('binned')['n-deam'].sum()['2-2.5'], comet.groupby('binned')['n-deam'].sum()['2.5-3'], comet.groupby('binned')['n-deam'].sum()['3-3.5'], comet.groupby('binned')['n-deam'].sum()['3.5-4'], comet.groupby('binned')['n-deam'].sum()['4-4.5'], comet.groupby('binned')['n-deam'].sum()['4.5-5'], comet.groupby('binned')['n-deam'].sum()['5-5.5'], comet.groupby('binned')['n-deam'].sum()['5.5-6'], comet.groupby('binned')['n-deam'].sum()['6-6.5'], comet.groupby('binned')['n-deam'].sum()['6.5-7'], comet.groupby('binned')['n-deam'].sum()['7-7.5'], comet.groupby('binned')['n-deam'].sum()['7.5-8'], comet.groupby('binned')['n-deam'].sum()['8-8.5'], comet.groupby('binned')['n-deam'].sum()['8.5-9']],
        'Glut deam.': [comet.groupby('binned')['q-deam'].sum()['0-0.5'], comet.groupby('binned')['q-deam'].sum()['0.5-1'], comet.groupby('binned')['q-deam'].sum()['1-1.5'], comet.groupby('binned')['q-deam'].sum()['1.5-2'], comet.groupby('binned')['q-deam'].sum()['2-2.5'], comet.groupby('binned')['q-deam'].sum()['2.5-3'], comet.groupby('binned')['q-deam'].sum()['3-3.5'], comet.groupby('binned')['q-deam'].sum()['3.5-4'], comet.groupby('binned')['q-deam'].sum()['4-4.5'], comet.groupby('binned')['q-deam'].sum()['4.5-5'], comet.groupby('binned')['q-deam'].sum()['5-5.5'], comet.groupby('binned')['q-deam'].sum()['5.5-6'], comet.groupby('binned')['q-deam'].sum()['6-6.5'], comet.groupby('binned')['q-deam'].sum()['6.5-7'], comet.groupby('binned')['q-deam'].sum()['7-7.5'], comet.groupby('binned')['q-deam'].sum()['7.5-8'], comet.groupby('binned')['q-deam'].sum()['8-8.5'], comet.groupby('binned')['q-deam'].sum()['8.5-9']],
        'Lys iron': [comet.groupby('binned')['k-iron'].sum()['0-0.5'], comet.groupby('binned')['k-iron'].sum()['0.5-1'], comet.groupby('binned')['k-iron'].sum()['1-1.5'], comet.groupby('binned')['k-iron'].sum()['1.5-2'], comet.groupby('binned')['k-iron'].sum()['2-2.5'], comet.groupby('binned')['k-iron'].sum()['2.5-3'], comet.groupby('binned')['k-iron'].sum()['3-3.5'], comet.groupby('binned')['k-iron'].sum()['3.5-4'], comet.groupby('binned')['k-iron'].sum()['4-4.5'], comet.groupby('binned')['k-iron'].sum()['4.5-5'], comet.groupby('binned')['k-iron'].sum()['5-5.5'], comet.groupby('binned')['k-iron'].sum()['5.5-6'], comet.groupby('binned')['k-iron'].sum()['6-6.5'], comet.groupby('binned')['k-iron'].sum()['6.5-7'], comet.groupby('binned')['k-iron'].sum()['7-7.5'], comet.groupby('binned')['k-iron'].sum()['7.5-8'], comet.groupby('binned')['k-iron'].sum()['8-8.5'], comet.groupby('binned')['k-iron'].sum()['8.5-9']],
        'Lys meth.': [comet.groupby('binned')['k-meth'].sum()['0-0.5'], comet.groupby('binned')['k-meth'].sum()['0.5-1'], comet.groupby('binned')['k-meth'].sum()['1-1.5'], comet.groupby('binned')['k-meth'].sum()['1.5-2'], comet.groupby('binned')['k-meth'].sum()['2-2.5'], comet.groupby('binned')['k-meth'].sum()['2.5-3'], comet.groupby('binned')['k-meth'].sum()['3-3.5'], comet.groupby('binned')['k-meth'].sum()['3.5-4'], comet.groupby('binned')['k-meth'].sum()['4-4.5'], comet.groupby('binned')['k-meth'].sum()['4.5-5'], comet.groupby('binned')['k-meth'].sum()['5-5.5'], comet.groupby('binned')['k-meth'].sum()['5.5-6'], comet.groupby('binned')['k-meth'].sum()['6-6.5'], comet.groupby('binned')['k-meth'].sum()['6.5-7'], comet.groupby('binned')['k-meth'].sum()['7-7.5'], comet.groupby('binned')['k-meth'].sum()['7.5-8'], comet.groupby('binned')['k-meth'].sum()['8-8.5'], comet.groupby('binned')['k-meth'].sum()['8.5-9']],
        'Arg meth.': [comet.groupby('binned')['r-meth'].sum()['0-0.5'], comet.groupby('binned')['r-meth'].sum()['0.5-1'], comet.groupby('binned')['r-meth'].sum()['1-1.5'], comet.groupby('binned')['r-meth'].sum()['1.5-2'], comet.groupby('binned')['r-meth'].sum()['2-2.5'], comet.groupby('binned')['r-meth'].sum()['2.5-3'], comet.groupby('binned')['r-meth'].sum()['3-3.5'], comet.groupby('binned')['r-meth'].sum()['3.5-4'], comet.groupby('binned')['r-meth'].sum()['4-4.5'], comet.groupby('binned')['r-meth'].sum()['4.5-5'], comet.groupby('binned')['r-meth'].sum()['5-5.5'], comet.groupby('binned')['r-meth'].sum()['5.5-6'], comet.groupby('binned')['r-meth'].sum()['6-6.5'], comet.groupby('binned')['r-meth'].sum()['6.5-7'], comet.groupby('binned')['r-meth'].sum()['7-7.5'], comet.groupby('binned')['r-meth'].sum()['7.5-8'], comet.groupby('binned')['r-meth'].sum()['8-8.5'], comet.groupby('binned')['r-meth'].sum()['8.5-9']]
        }

cometbin = pd.DataFrame(data, columns=['Total PTMs','Cys carb.','Met oxi.','Asp deam.', 'Glut deam.', 'Lys iron', 'Lys meth.', 'Arg meth.'], index=index)

ax1 = cometbin.plot.bar(y='Total PTMs', rot=45)
ax1.set_title('Total PTMs')

ax2 = cometbin.plot.bar(y='Cys carb.', rot=45)
ax2.set_title('Cysteine carbamidomethylation')

ax3 = cometbin.plot.bar(y='Met oxi.', rot=45)
ax3.set_title('Methionine oxidation')

ax4 = cometbin.plot.bar(y='Asp deam.', rot=45)
ax4.set_title('Asparagine deamidation')

ax5 = cometbin.plot.bar(y='Glut deam.', rot=45)
ax5.set_title('Glutamine deamidation')

ax6 = cometbin.plot.bar(y='Lys iron', rot=45)
ax6.set_title('Lysine iron adduct')

ax7 = cometbin.plot.bar(y='Lys meth.', rot=45)
ax7.set_title('Lysine methylation')

ax7 = cometbin.plot.bar(y='Arg meth.', rot=45)
ax7.set_title('Arginine methylation')

#plt.savefig('/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pronovo-2020-ptm/MED4_trypsin1_Comet_PTMopt.png')

In [None]:
# histogram of stripped peptide lengths

#plt.rcdefaults()
#fig, ax = plt.subplots()

comet.plot(y='stripped length', kind = 'hist', bins = 20, title = 'Peptide length')
plt

In [None]:
# histogram of total peptide amounts

#plt.rcdefaults()
#fig, ax = plt.subplots()

comet.plot(y='ptm-total', kind = 'hist', bins = 20, title = 'PTMs/sequence')
plt

In [None]:
from matplotlib import pyplot
# density plot of xcorr vs length (idea from https://python-graph-gallery.com/85-density-plot-with-matplotlib/)

# read in data
x = comet['xcorr']
y = comet['stripped length']
 
# evaluate a gaussian kernel density estimation (KDE) on a regular grid of nbins x nbins over data extents
nbins=300
k = kde.gaussian_kde([x,y])
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
 
# make the plot
plt.pcolormesh(xi, yi, zi.reshape(xi.shape))
plt.colorbar()
plt.show() 

In [None]:
# plot with density of xcorr vs length and xcorr vs total ptms
plt.figure()

# read in data
x = comet['xcorr']
y = comet['stripped length']

a = comet['xcorr']
b = comet['ptm-total']

# evaluate a gaussian kernel density estimation (KDE) on a regular grid of nbins x nbins over data extents
nbins=300
k = kde.gaussian_kde([x,y])
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))

c = kde.gaussian_kde([a,b])
ai, bi = np.mgrid[a.min():a.max():nbins*1j, b.min():b.max():nbins*1j]
di = k(np.vstack([ai.flatten(), bi.flatten()]))

# density plot of length vs xcorr
plt.subplot(221)
plt.pcolormesh(xi, yi, zi.reshape(xi.shape))
plt.title('logit')
plt.colorbar()

# density plot of xcorr vs ptms
plt.subplot(222)
plt.pcolormesh(ai, bi, di.reshape(ai.shape))
plt.title('Combined MED4 Comet XCorr vs total PTMs')
plt.colorbar()


plt.show()

### XCorr > 2.5 - Visualizing the results:

In [None]:
# making evenly spaced bins for the Xcorr data based on the min and max, called above
bins = [2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9]
labels = ['2.5-3', '3-3.5', '3.5-4', '4-4.5', '4.5-5', '5-5.5', '5.5-6', '6-6.5', '6.5-7', '7-7.5', '7.5-8', '8-8.5', '8.5-9']

# use pandas cut function to do the binning itself
comet25['binned'] = pd.cut(comet25['xcorr'], bins=bins, labels=labels)

# bar plots of binned PTM data

index = ['2.5-3', '3-3.5', '3.5-4', '4-4.5', '4.5-5', '5-5.5', '5.5-6', '6-6.5', '6.5-7', '7-7.5', '7.5-8', '8-8.5', '8.5-9']
data = {'Total PTMs': [comet25.groupby('binned')['ptm-total'].sum()['2.5-3'], comet25.groupby('binned')['ptm-total'].sum()['3-3.5'], comet25.groupby('binned')['ptm-total'].sum()['3.5-4'], comet25.groupby('binned')['ptm-total'].sum()['4-4.5'], comet25.groupby('binned')['ptm-total'].sum()['4.5-5'], comet25.groupby('binned')['ptm-total'].sum()['5-5.5'], comet25.groupby('binned')['ptm-total'].sum()['5.5-6'], comet25.groupby('binned')['ptm-total'].sum()['6-6.5'], comet25.groupby('binned')['ptm-total'].sum()['6.5-7'], comet25.groupby('binned')['ptm-total'].sum()['7-7.5'], comet25.groupby('binned')['ptm-total'].sum()['7.5-8'], comet25.groupby('binned')['ptm-total'].sum()['8-8.5'], comet25.groupby('binned')['ptm-total'].sum()['8.5-9']],
        'Cys carb.': [comet25.groupby('binned')['c-carb'].sum()['2.5-3'], comet25.groupby('binned')['c-carb'].sum()['3-3.5'], comet25.groupby('binned')['c-carb'].sum()['3.5-4'], comet25.groupby('binned')['c-carb'].sum()['4-4.5'], comet25.groupby('binned')['c-carb'].sum()['4.5-5'], comet25.groupby('binned')['c-carb'].sum()['5-5.5'], comet25.groupby('binned')['c-carb'].sum()['5.5-6'], comet25.groupby('binned')['c-carb'].sum()['6-6.5'], comet25.groupby('binned')['c-carb'].sum()['6.5-7'], comet25.groupby('binned')['c-carb'].sum()['7-7.5'], comet25.groupby('binned')['c-carb'].sum()['7.5-8'], comet25.groupby('binned')['c-carb'].sum()['8-8.5'], comet25.groupby('binned')['c-carb'].sum()['8.5-9']],
        'Met oxi.': [comet25.groupby('binned')['m-oxid'].sum()['2.5-3'], comet25.groupby('binned')['m-oxid'].sum()['3-3.5'], comet25.groupby('binned')['m-oxid'].sum()['3.5-4'], comet25.groupby('binned')['m-oxid'].sum()['4-4.5'], comet25.groupby('binned')['m-oxid'].sum()['4.5-5'], comet25.groupby('binned')['m-oxid'].sum()['5-5.5'], comet25.groupby('binned')['m-oxid'].sum()['5.5-6'], comet25.groupby('binned')['m-oxid'].sum()['6-6.5'], comet25.groupby('binned')['m-oxid'].sum()['6.5-7'], comet25.groupby('binned')['m-oxid'].sum()['7-7.5'], comet25.groupby('binned')['m-oxid'].sum()['7.5-8'], comet25.groupby('binned')['m-oxid'].sum()['8-8.5'], comet25.groupby('binned')['m-oxid'].sum()['8.5-9']],
        'Asp deam.': [comet25.groupby('binned')['n-deam'].sum()['2.5-3'], comet25.groupby('binned')['n-deam'].sum()['3-3.5'], comet25.groupby('binned')['n-deam'].sum()['3.5-4'], comet25.groupby('binned')['n-deam'].sum()['4-4.5'], comet25.groupby('binned')['n-deam'].sum()['4.5-5'], comet25.groupby('binned')['n-deam'].sum()['5-5.5'], comet25.groupby('binned')['n-deam'].sum()['5.5-6'], comet25.groupby('binned')['n-deam'].sum()['6-6.5'], comet25.groupby('binned')['n-deam'].sum()['6.5-7'], comet25.groupby('binned')['n-deam'].sum()['7-7.5'], comet25.groupby('binned')['n-deam'].sum()['7.5-8'], comet25.groupby('binned')['n-deam'].sum()['8-8.5'], comet25.groupby('binned')['n-deam'].sum()['8.5-9']],
        'Glut deam.': [comet25.groupby('binned')['q-deam'].sum()['2.5-3'], comet25.groupby('binned')['q-deam'].sum()['3-3.5'], comet25.groupby('binned')['q-deam'].sum()['3.5-4'], comet25.groupby('binned')['q-deam'].sum()['4-4.5'], comet25.groupby('binned')['q-deam'].sum()['4.5-5'], comet25.groupby('binned')['q-deam'].sum()['5-5.5'], comet25.groupby('binned')['q-deam'].sum()['5.5-6'], comet25.groupby('binned')['q-deam'].sum()['6-6.5'], comet25.groupby('binned')['q-deam'].sum()['6.5-7'], comet25.groupby('binned')['q-deam'].sum()['7-7.5'], comet25.groupby('binned')['q-deam'].sum()['7.5-8'], comet25.groupby('binned')['q-deam'].sum()['8-8.5'], comet25.groupby('binned')['q-deam'].sum()['8.5-9']],
        'Lys iron': [comet25.groupby('binned')['k-iron'].sum()['2.5-3'], comet25.groupby('binned')['k-iron'].sum()['3-3.5'], comet25.groupby('binned')['k-iron'].sum()['3.5-4'], comet25.groupby('binned')['k-iron'].sum()['4-4.5'], comet25.groupby('binned')['k-iron'].sum()['4.5-5'], comet25.groupby('binned')['k-iron'].sum()['5-5.5'], comet25.groupby('binned')['k-iron'].sum()['5.5-6'], comet25.groupby('binned')['k-iron'].sum()['6-6.5'], comet25.groupby('binned')['k-iron'].sum()['6.5-7'], comet25.groupby('binned')['k-iron'].sum()['7-7.5'], comet25.groupby('binned')['k-iron'].sum()['7.5-8'], comet25.groupby('binned')['k-iron'].sum()['8-8.5'], comet25.groupby('binned')['k-iron'].sum()['8.5-9']],
        'Lys meth.': [comet25.groupby('binned')['k-meth'].sum()['2.5-3'], comet25.groupby('binned')['k-meth'].sum()['3-3.5'], comet25.groupby('binned')['k-meth'].sum()['3.5-4'], comet25.groupby('binned')['k-meth'].sum()['4-4.5'], comet25.groupby('binned')['k-meth'].sum()['4.5-5'], comet25.groupby('binned')['k-meth'].sum()['5-5.5'], comet25.groupby('binned')['k-meth'].sum()['5.5-6'], comet25.groupby('binned')['k-meth'].sum()['6-6.5'], comet25.groupby('binned')['k-meth'].sum()['6.5-7'], comet25.groupby('binned')['k-meth'].sum()['7-7.5'], comet25.groupby('binned')['k-meth'].sum()['7.5-8'], comet25.groupby('binned')['k-meth'].sum()['8-8.5'], comet25.groupby('binned')['k-meth'].sum()['8.5-9']],
        'Arg meth.': [comet25.groupby('binned')['r-meth'].sum()['2.5-3'], comet25.groupby('binned')['r-meth'].sum()['3-3.5'], comet25.groupby('binned')['r-meth'].sum()['3.5-4'], comet25.groupby('binned')['r-meth'].sum()['4-4.5'], comet25.groupby('binned')['r-meth'].sum()['4.5-5'], comet25.groupby('binned')['r-meth'].sum()['5-5.5'], comet25.groupby('binned')['r-meth'].sum()['5.5-6'], comet25.groupby('binned')['r-meth'].sum()['6-6.5'], comet25.groupby('binned')['r-meth'].sum()['6.5-7'], comet25.groupby('binned')['r-meth'].sum()['7-7.5'], comet25.groupby('binned')['r-meth'].sum()['7.5-8'], comet25.groupby('binned')['r-meth'].sum()['8-8.5'], comet25.groupby('binned')['r-meth'].sum()['8.5-9']]
        }

comet25bin = pd.DataFrame(data, columns=['Total PTMs','Cys carb.','Met oxi.','Asp deam.', 'Glut deam.', 'Lys iron', 'Lys meth.', 'Arg meth.'], index=index)

ax1 = comet25bin.plot.bar(y='Total PTMs', rot=45)
ax1.set_title('Total PTMs')

ax2 = comet25bin.plot.bar(y='Cys carb.', rot=45)
ax2.set_title('Cysteine carbamidomethylation')

ax3 = comet25bin.plot.bar(y='Met oxi.', rot=45)
ax3.set_title('Methionine oxidation')

ax4 = comet25bin.plot.bar(y='Asp deam.', rot=45)
ax4.set_title('Asparagine deamidation')

ax5 = comet25bin.plot.bar(y='Glut deam.', rot=45)
ax5.set_title('Glutamine deamidation')

ax6 = comet25bin.plot.bar(y='Lys iron', rot=45)
ax6.set_title('Lysine iron adduct')

ax7 = comet25bin.plot.bar(y='Lys meth.', rot=45)
ax7.set_title('Lysine methylation')

ax7 = comet25bin.plot.bar(y='Arg meth.', rot=45)
ax7.set_title('Arginine methylation')

#plt.savefig('/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pronovo-2020-ptm/MED4_trypsin1_comet25_PTMopt.png')

In [None]:
# histogram of stripped peptide lengths

#plt.rcdefaults()
#fig, ax = plt.subplots()

comet25.plot(y='stripped length', kind = 'hist', bins = 20, title = 'Peptide length')
plt

In [None]:
# histogram of total peptide amounts

#plt.rcdefaults()
#fig, ax = plt.subplots()

comet25.plot(y='ptm-total', kind = 'hist', bins = 20, title = 'PTMs/sequence')
plt

In [None]:
from matplotlib import pyplot
# density plot of xcorr vs length (idea from https://python-graph-gallery.com/85-density-plot-with-matplotlib/)

# read in data
x = comet25['xcorr']
y = comet25['stripped length']
 
# evaluate a gaussian kernel density estimation (KDE) on a regular grid of nbins x nbins over data extents
nbins=300
k = kde.gaussian_kde([x,y])
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
 
# make the plot
plt.pcolormesh(xi, yi, zi.reshape(xi.shape))
plt.colorbar()
plt.show() 

In [None]:
# now we have the stripped peptide csvs and txt files in the same data dir:
!ls /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/