### Manipulation of Trans Proteomic Pipeline (TPP) PeptideProphet peptide validation results from Comet database searched*Prochlorococus MED4* LC-MS/MS data using python.

Starting with: 

- PeptideProphet output (.xlxs and .csv) of PTM-optimized database searches >90% probability

Goal:

- Files with stripped (no PTMs or tryptic ends) peptide lists and
- Columns with #'s of each modification in every sequence
- Column with stripped peptide lengths (# amino acids)

For technical duplicates, I exported PeptideProphet results as both Excel files and as CSVs into my ETNP 2017 git repo:

In [1]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP


In [2]:
ls

RAL4_MED2_trypsin_1_PTMopt_Comet.csv
RAL4_MED2_trypsin_1_PTMopt_Comet_stripped.csv
RAL4_MED2_trypsin_1_PTMopt_Comet.xlsx
RAL4_MED2_trypsin_1_PTMopt_PepProp90.csv
RAL4_MED2_trypsin_1_PTMopt_PepProp90.xlsx
RAL4_MED2_trypsin_2_PTMopt_Comet.csv
RAL4_MED2_trypsin_2_PTMopt_Comet_stripped.csv
RAL4_MED2_trypsin_2_PTMopt_Comet.xlsx
RAL4_MED2_trypsin_2_PTMopt_PepProp90.csv
RAL4_MED2_trypsin_2_PTMopt_PepProp90.xlsx


In [3]:
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [4]:
#read the CSV into a datadrame we name 'comet' using the pandas read_csv function
comet = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_PepProp90.csv")


#look at the dataframe
comet.head()

Unnamed: 0,probability,spectrum,expect,ions,peptide,protein,calc_neutral_pep_mass
0,1.0,022016_RAL4_95_MED2_trypsin_1.04846.04846.2,0.000252,18-Jul,K.VGAATETEM[147.04]K.Y,"PMM0452,PMM1436",1051.4856
1,1.0,022016_RAL4_95_MED2_trypsin_1.05313.05313.3,0.000603,Aug-48,K.VEAHPIPEHPRPR.R,PMM0760,1533.8164
2,1.0,022016_RAL4_95_MED2_trypsin_1.05371.05371.3,3.79e-07,17/48,K.FHSAEVDSETDHR.V,PMM0613,1528.6542
3,1.0,022016_RAL4_95_MED2_trypsin_1.05377.05377.2,3.59e-07,16-Sep,K.YC[160.03]DDAINKR.E,PMM0574,1153.5186
4,1.0,022016_RAL4_95_MED2_trypsin_1.05457.05457.4,0.00607,Jul-90,R.HGGGAFSGKDPTKVDR.S,PMM0311,1627.8067


The peptide column has the residues before and after the tryptic terminii as well as masses of modified residues (e.g., 160.03 Da for carbamidomethylated cysteine). We want to make new columns with all that information and make a column with only the 'stripped' peptide sequence that's just amino acids - this we can then align against other sequences, for example.

Modified residues were allowed for:

    fixed carbamodimethylation of cysteine 57.021464 C
    varialbe oxidation of methionine: 15.9949 M
    variable deamidation of asparagine, glumatine: 0.984016 NQ
    variable iron cation on lysine: 54.010565 K
    variable methylation of lysine and arginine: 14.015650 KR

We'll then write this manipulated dataframe to a new file.

In [5]:
# use str.strip with indexing by str[0] to add a column with the peptide's left terminus
comet['L terminus'] = comet['peptide'].astype(str).str[0]

# use str.strip with indexing by str[-1] to add a column with the peptide's left terminus
comet['R terminus'] = comet['peptide'].str.strip().str[-1]

# use a count function to enumerate the # of ccarbamidomethylated C's in each peptide
comet['c-carb'] = comet['peptide'].str.count("160.03")

# use a count function to enumerate the # of oxidized M's in each peptide
comet['m-oxid'] = comet['peptide'].str.count("147.04")

# use a count function to enumerate the # of deamidated N's in each peptide
comet['n-deam'] = comet['peptide'].str.count("115.03")

# use a count function to enumerate the # of deamidated Q's in each peptide
comet['q-deam'] = comet['peptide'].str.count("129.04")

# use a count function to enumerate the # of iron adducted K's in each peptide
comet['k-iron'] = comet['peptide'].str.count("182.11")

# use a count function to enumerate the # of methylated K's in each peptide
comet['k-meth'] = comet['peptide'].str.count("142.11")

# use a count function to enumerate the # of methylated R's in each peptide
comet['r-meth'] = comet['peptide'].str.count("170.12")

# create a column with 'stripped' peptide sequences using strip
comet['stripped peptide'] = comet['peptide'].str[2:].str[:-2].str.replace(r"\[.*\]","")

# add a column with the stripped peptide length (number of AAs)
comet['stripped length'] = comet['stripped peptide'].apply(len)

# write modified dataframe to new txt file, same name + 'stripped'
comet.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped.csv")


# check out the results
comet.head()

Unnamed: 0,probability,spectrum,expect,ions,peptide,protein,calc_neutral_pep_mass,L terminus,R terminus,c-carb,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length
0,1.0,022016_RAL4_95_MED2_trypsin_1.04846.04846.2,0.000252,18-Jul,K.VGAATETEM[147.04]K.Y,"PMM0452,PMM1436",1051.4856,K,Y,0,1,0,0,0,0,0,VGAATETEMK,10
1,1.0,022016_RAL4_95_MED2_trypsin_1.05313.05313.3,0.000603,Aug-48,K.VEAHPIPEHPRPR.R,PMM0760,1533.8164,K,R,0,0,0,0,0,0,0,VEAHPIPEHPRPR,13
2,1.0,022016_RAL4_95_MED2_trypsin_1.05371.05371.3,3.79e-07,17/48,K.FHSAEVDSETDHR.V,PMM0613,1528.6542,K,V,0,0,0,0,0,0,0,FHSAEVDSETDHR,13
3,1.0,022016_RAL4_95_MED2_trypsin_1.05377.05377.2,3.59e-07,16-Sep,K.YC[160.03]DDAINKR.E,PMM0574,1153.5186,K,E,1,0,0,0,0,0,0,YCDDAINKR,9
4,1.0,022016_RAL4_95_MED2_trypsin_1.05457.05457.4,0.00607,Jul-90,R.HGGGAFSGKDPTKVDR.S,PMM0311,1627.8067,R,S,0,0,0,0,0,0,0,HGGGAFSGKDPTKVDR,16


Now doing the same manipulation for the duplicate MED2_trypsin injection PepProp output:

In [6]:
#read the CSV into a datadrame we name 'comet' using the pandas read_csv function
comet = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_2_PTMopt_PepProp90.csv")


#look at the dataframe
comet.head()

Unnamed: 0,probability,spectrum,expect,ions,peptide,protein,calc_neutral_pep_mass
0,1.0,022016_RAL4_95_MED2_trypsin_2.05324.05324.3,1e-06,17/48,K.FHSAEVDSETDHR.V,PMM0613,1528.6542
1,1.0,022016_RAL4_95_MED2_trypsin_2.05441.05441.3,0.00616,Jul-44,F.DIHTGDAEEATR.K,PMM1524,1313.5848
2,1.0,022016_RAL4_95_MED2_trypsin_2.05511.05511.3,0.000207,Aug-44,K.AETEDVKETEVK.E,PMM1402,1376.6671
3,1.0,022016_RAL4_95_MED2_trypsin_2.05565.05565.4,125.0,Jan-96,K.PEDC[160.03]N[115.03]EC[160.03]DGAM[147.04]S...,DECOY_PMM0901_UNMAPPED,2059.7339
4,1.0,022016_RAL4_95_MED2_trypsin_2.05593.05593.4,4.1e-05,9/120,M.SKRHPVVAVTGSSGAGTSTVK.R,PMM0785,2025.0967


In [7]:
# use str.strip with indexing by str[0] to add a column with the peptide's left terminus
comet['L terminus'] = comet['peptide'].astype(str).str[0]

# use str.strip with indexing by str[-1] to add a column with the peptide's left terminus
comet['R terminus'] = comet['peptide'].str.strip().str[-1]

# use a count function to enumerate the # of ccarbamidomethylated C's in each peptide
comet['c-carb'] = comet['peptide'].str.count("160.03")

# use a count function to enumerate the # of oxidized M's in each peptide
comet['m-oxid'] = comet['peptide'].str.count("147.04")

# use a count function to enumerate the # of deamidated N's in each peptide
comet['n-deam'] = comet['peptide'].str.count("115.03")

# use a count function to enumerate the # of deamidated Q's in each peptide
comet['q-deam'] = comet['peptide'].str.count("129.04")

# use a count function to enumerate the # of iron adducted K's in each peptide
comet['k-iron'] = comet['peptide'].str.count("182.11")

# use a count function to enumerate the # of methylated K's in each peptide
comet['k-meth'] = comet['peptide'].str.count("142.11")

# use a count function to enumerate the # of methylated R's in each peptide
comet['r-meth'] = comet['peptide'].str.count("170.12")

# create a column with 'stripped' peptide sequences using strip
comet['stripped peptide'] = comet['peptide'].str[2:].str[:-2].str.replace(r"\[.*\]","")

# add a column with the stripped peptide length (number of AAs)
comet['stripped length'] = comet['stripped peptide'].apply(len)

# write modified dataframe to new txt file, same name + 'stripped'
comet.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_2_PTMopt_PepProp90_stripped.csv")


# check out the results
comet.head()

Unnamed: 0,probability,spectrum,expect,ions,peptide,protein,calc_neutral_pep_mass,L terminus,R terminus,c-carb,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length
0,1.0,022016_RAL4_95_MED2_trypsin_2.05324.05324.3,1e-06,17/48,K.FHSAEVDSETDHR.V,PMM0613,1528.6542,K,V,0,0,0,0,0,0,0,FHSAEVDSETDHR,13
1,1.0,022016_RAL4_95_MED2_trypsin_2.05441.05441.3,0.00616,Jul-44,F.DIHTGDAEEATR.K,PMM1524,1313.5848,F,K,0,0,0,0,0,0,0,DIHTGDAEEATR,12
2,1.0,022016_RAL4_95_MED2_trypsin_2.05511.05511.3,0.000207,Aug-44,K.AETEDVKETEVK.E,PMM1402,1376.6671,K,E,0,0,0,0,0,0,0,AETEDVKETEVK,12
3,1.0,022016_RAL4_95_MED2_trypsin_2.05565.05565.4,125.0,Jan-96,K.PEDC[160.03]N[115.03]EC[160.03]DGAM[147.04]S...,DECOY_PMM0901_UNMAPPED,2059.7339,K,V,2,2,1,1,0,0,0,PEDC,4
4,1.0,022016_RAL4_95_MED2_trypsin_2.05593.05593.4,4.1e-05,9/120,M.SKRHPVVAVTGSSGAGTSTVK.R,PMM0785,2025.0967,M,R,0,0,0,0,0,0,0,SKRHPVVAVTGSSGAGTSTVK,21


Now we should have two stripped peptide csvs in the directory:

In [8]:
ls /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/

RAL4_MED2_trypsin_1_PTMopt_Comet.csv
RAL4_MED2_trypsin_1_PTMopt_Comet_stripped.csv
RAL4_MED2_trypsin_1_PTMopt_Comet.xlsx
RAL4_MED2_trypsin_1_PTMopt_PepProp90.csv
RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped.csv
RAL4_MED2_trypsin_1_PTMopt_PepProp90.xlsx
RAL4_MED2_trypsin_2_PTMopt_Comet.csv
RAL4_MED2_trypsin_2_PTMopt_Comet_stripped.csv
RAL4_MED2_trypsin_2_PTMopt_Comet.xlsx
RAL4_MED2_trypsin_2_PTMopt_PepProp90.csv
RAL4_MED2_trypsin_2_PTMopt_PepProp90_stripped.csv
RAL4_MED2_trypsin_2_PTMopt_PepProp90.xlsx
