### Manipulation of Peaks de novo results of Prochlorococus MED4 LC-MS/MS data using python.

Starting with:

    Peaks de novo results (.csv) of PTM-optimized database searches

Goal:

    Files with stripped (no PTMs) peptide lists and
    Columns with #'s of each modification in every sequence
    Column with stripped peptide lengths (# amino acids)

For technical duplicates, I exported PeaksDN search results CSVs into my ETNP 2017 git repo:

In [1]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN


In [2]:
ls

RAL95_MED2_trypsin_1_PTMopt_DN50.csv
RAL95_MED2_trypsin_1_PTMopt_DN50_stripped.csv
RAL95_MED2_trypsin_1_PTMopt_DN50_stripped_peptides
RAL95_MED2_trypsin_1_PTMopt_DN80_stripped_peptides
RAL95_MED2_trypsin_2_PTMopt_DN50.csv


In [3]:
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [31]:
#read the CSV into a datadrame we name 'comet' using the pandas read_csv function
peaks = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/RAL95_MED2_trypsin_1_PTMopt_DN50.csv")


#look at the dataframe
peaks.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,3,15768,022016_RAL4_95_MED2_trypsin_1.raw,KELN(+.98)LDTDLGK,11,98,11,623.8295,2,26.8,2620000.0,1245.6453,-0.6,Deamidation (NQ),98 100 99 99 99 99 99 99 100 99 99,KELN(+.98)LDTDLGK,CID
1,3,6171,022016_RAL4_95_MED2_trypsin_1.raw,KDLESLDSTNK,11,98,11,625.3166,2,12.87,4180000.0,1248.6196,-0.8,,98 100 100 100 99 99 99 97 98 99 99,KDLESLDSTNK,CID
2,3,46585,022016_RAL4_95_MED2_trypsin_1.raw,FFLLFK,6,98,6,407.7466,2,68.38,7850000.0,813.4788,-0.3,,96 98 99 100 99 99,FFLLFK,CID
3,3,28681,022016_RAL4_95_MED2_trypsin_1.raw,KLFTDYQELMK,11,98,11,708.3657,2,44.32,4690000.0,1414.7166,0.2,,99 100 99 99 99 96 94 99 99 99 99,KLFTDYQELMK,CID
4,3,39806,022016_RAL4_95_MED2_trypsin_1.raw,WALEELLNK,9,98,9,558.3083,2,59.1,43600000.0,1114.6023,-0.2,,98 99 100 100 99 97 98 97 98,WALEELLNK,CID


The peptide column has the masses of modifications (e.g., 57.02 Da for carbamidomethylation of cysteine). We want to make new columns with all that information and make a column with only the 'stripped' peptide sequence that's just amino acids - this we can then align against other sequences, for example.

Modified residues were allowed for:

    fixed carbamodimethylation of cysteine 57.021464 C
    varialbe oxidation of methionine: 15.9949 M
    variable deamidation of asparagine, glumatine: 0.984016 NQ
    variable iron cation on lysine: 54.010565 K
    variable methylation of lysine and arginine: 14.015650 KR

We'll then write this manipulated dataframe to a new file.

In [46]:
# use a count function to enumerate the # of ccarbamidomethylated C's in each peptide
peaks['c-carb'] = peaks['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks['m-oxid'] = peaks['Peptide'].str.count("15.99")

# use a count function to enumerate the # of deamidated N's in each peptide
peaks['n-deam'] = peaks['Peptide'].str.count("N\(+.98")

# use a count function to enumerate the # of deamidated Q's in each peptide
peaks['q-deam'] = peaks['Peptide'].str.count("Q\(+.98")

# use a count function to enumerate the # of iron adducted K's in each peptide
peaks['k-iron'] = peaks['Peptide'].str.count("53.92")

# use a count function to enumerate the # of methylated K's in each peptide
peaks['k-meth'] = peaks['Peptide'].str.count("K\(+14.02")

# use a count function to enumerate the # of methylated R's in each peptide
peaks['r-meth'] = peaks['Peptide'].str.count("R\(+14.02")

# create a column with 'stripped' peptide sequences using strip
peaks['stripped peptide'] = peaks['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks['stripped length'] = peaks['stripped peptide'].apply(len)

# write modified dataframe to new txt file, same name + 'stripped'
peaks.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/RAL95_MED2_trypsin_1_PTMopt_DN50_stripped.csv")


# check out the results
peaks.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,...,mode,c-carb,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length
0,4,16526,022016_RAL4_95_MED2_trypsin_2.raw,KLEEALQELK,10,98,10,600.8452,2,28.02,...,CID,0,0,0,0,0,0,0,KLEEALQELK,10
1,4,29615,022016_RAL4_95_MED2_trypsin_2.raw,SKDNLLSLLK,10,98,10,565.843,2,45.63,...,CID,0,0,0,0,0,0,0,SKDNLLSLLK,10
2,4,12960,022016_RAL4_95_MED2_trypsin_2.raw,KLNER(+14.02)ETTLK,10,98,10,623.3551,2,22.84,...,CID,0,0,0,0,0,0,0,KLNERETTLK,10
3,4,22861,022016_RAL4_95_MED2_trypsin_2.raw,KSLSTLLAMEYQDK,14,98,14,813.916,2,36.69,...,CID,0,0,0,0,0,0,0,KSLSTLLAMEYQDK,14
4,4,46354,022016_RAL4_95_MED2_trypsin_2.raw,FFLLFK,6,98,6,407.7462,2,68.1,...,CID,0,0,0,0,0,0,0,FFLLFK,6


In [47]:
# keep only peptide column >50% ALC
pep = peaks[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/RAL95_MED2_trypsin_1_PTMopt_DN50_stripped_peptides.txt", header=False, index=False)


# look
pep.head()

Unnamed: 0,stripped peptide
0,KLEEALQELK
1,SKDNLLSLLK
2,KLNERETTLK
3,KSLSTLLAMEYQDK
4,FFLLFK


In [50]:
# keep only peptides  >80% ALC
peaks80 = peaks.loc[peaks['ALC (%)'] >= 80]

# see how many rows and double check
# peaks80.head(-10)

# keep only peptide column 
pep80 = peaks80[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/RAL95_MED2_trypsin_1_PTMopt_DN80_stripped_peptides.txt", header=False, index=False)


# look
pep80.head()

Unnamed: 0,stripped peptide
0,KLEEALQELK
1,SKDNLLSLLK
2,KLNERETTLK
3,KSLSTLLAMEYQDK
4,FFLLFK


In [51]:
#read the CSV into a datadrame we name 'comet' using the pandas read_csv function
peaks = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/RAL95_MED2_trypsin_2_PTMopt_DN50.csv")


#look at the dataframe
peaks.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,4,16526,022016_RAL4_95_MED2_trypsin_2.raw,KLEEALQELK,10,98,10,600.8452,2,28.02,17100000.0,1199.676,-0.1,,98 100 100 100 98 99 98 100 99 97,KLEEALQELK,CID
1,4,29615,022016_RAL4_95_MED2_trypsin_2.raw,SKDNLLSLLK,10,98,10,565.843,2,45.63,3250000.0,1129.6707,0.6,,97 99 99 98 99 99 99 100 100 98,SKDNLLSLLK,CID
2,4,12960,022016_RAL4_95_MED2_trypsin_2.raw,KLNER(+14.02)ETTLK,10,98,10,623.3551,2,22.84,13000000.0,1244.7087,-10.5,Methylation(KR),99 100 98 99 97 99 98 99 99 98,KLNER(+14.02)ETTLK,CID
3,4,22861,022016_RAL4_95_MED2_trypsin_2.raw,KSLSTLLAMEYQDK,14,98,14,813.916,2,36.69,2470000.0,1625.8335,-9.9,,97 99 100 98 98 99 100 99 99 100 100 97 98 97,KSLSTLLAMEYQDK,CID
4,4,46354,022016_RAL4_95_MED2_trypsin_2.raw,FFLLFK,6,98,6,407.7462,2,68.1,10100000.0,813.4788,-1.1,,96 98 99 100 99 99,FFLLFK,CID


In [52]:
# use a count function to enumerate the # of ccarbamidomethylated C's in each peptide
peaks['c-carb'] = peaks['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks['m-oxid'] = peaks['Peptide'].str.count("15.99")

# use a count function to enumerate the # of deamidated N's in each peptide
peaks['n-deam'] = peaks['Peptide'].str.count("N\(+.98")

# use a count function to enumerate the # of deamidated Q's in each peptide
peaks['q-deam'] = peaks['Peptide'].str.count("Q\(+.98")

# use a count function to enumerate the # of iron adducted K's in each peptide
peaks['k-iron'] = peaks['Peptide'].str.count("53.92")

# use a count function to enumerate the # of methylated K's in each peptide
peaks['k-meth'] = peaks['Peptide'].str.count("K\(+14.02")

# use a count function to enumerate the # of methylated R's in each peptide
peaks['r-meth'] = peaks['Peptide'].str.count("R\(+14.02")

# create a column with 'stripped' peptide sequences using strip
peaks['stripped peptide'] = peaks['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks['stripped length'] = peaks['stripped peptide'].apply(len)

# write modified dataframe to new txt file, same name + 'stripped'
peaks.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/RAL95_MED2_trypsin_2_PTMopt_DN50_stripped.csv")


# check out the results
peaks.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,...,mode,c-carb,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length
0,4,16526,022016_RAL4_95_MED2_trypsin_2.raw,KLEEALQELK,10,98,10,600.8452,2,28.02,...,CID,0,0,0,0,0,0,0,KLEEALQELK,10
1,4,29615,022016_RAL4_95_MED2_trypsin_2.raw,SKDNLLSLLK,10,98,10,565.843,2,45.63,...,CID,0,0,0,0,0,0,0,SKDNLLSLLK,10
2,4,12960,022016_RAL4_95_MED2_trypsin_2.raw,KLNER(+14.02)ETTLK,10,98,10,623.3551,2,22.84,...,CID,0,0,0,0,0,0,0,KLNERETTLK,10
3,4,22861,022016_RAL4_95_MED2_trypsin_2.raw,KSLSTLLAMEYQDK,14,98,14,813.916,2,36.69,...,CID,0,0,0,0,0,0,0,KSLSTLLAMEYQDK,14
4,4,46354,022016_RAL4_95_MED2_trypsin_2.raw,FFLLFK,6,98,6,407.7462,2,68.1,...,CID,0,0,0,0,0,0,0,FFLLFK,6


In [53]:
# keep only peptide column >50% ALC
pep50 = peaks[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/RAL95_MED2_trypsin_2_PTMopt_DN50_stripped_peptides.txt", header=False, index=False)


# look
pep50.head()

Unnamed: 0,stripped peptide
0,KLEEALQELK
1,SKDNLLSLLK
2,KLNERETTLK
3,KSLSTLLAMEYQDK
4,FFLLFK


In [54]:
# keep only peptides  >80% ALC
peaks80 = peaks.loc[peaks['ALC (%)'] >= 80]

# see how many rows and double check
# peaks80.head(-10)

# keep only peptide column 
pep80 = peaks80[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/RAL95_MED2_trypsin_2_PTMopt_DN80_stripped_peptides.txt", header=False, index=False)


# look
pep80.head()

Unnamed: 0,stripped peptide
0,KLEEALQELK
1,SKDNLLSLLK
2,KLNERETTLK
3,KSLSTLLAMEYQDK
4,FFLLFK


In [61]:
# now there are the orignial csvs, 
# the stripped peptide version of those, 
# and the txt only stripped peptides for >50 and >80% ALC
!ls /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/PeaksDN/

RAL95_MED2_trypsin_1_PTMopt_DN50.csv
RAL95_MED2_trypsin_1_PTMopt_DN50_stripped.csv
RAL95_MED2_trypsin_1_PTMopt_DN50_stripped_peptides.txt
RAL95_MED2_trypsin_1_PTMopt_DN80_stripped_peptides.txt
RAL95_MED2_trypsin_2_PTMopt_DN50.csv
RAL95_MED2_trypsin_2_PTMopt_DN50_stripped.csv
RAL95_MED2_trypsin_2_PTMopt_DN50_stripped_peptides.txt
RAL95_MED2_trypsin_2_PTMopt_DN80_stripped_peptides.txt
