### Manipulation of Trans Proteomic Pipeline (TPP) PeptideProphet peptide validation results from Comet database searched*Prochlorococus MED4* LC-MS/MS data using python.

Starting with: 

- PeptideProphet output (.xlxs and .csv) of PTM-optimized database searches >90% probability

Goal:

- Files with stripped (no PTMs or tryptic ends) peptide lists and
- Columns with #'s of each modification in every sequence
- Column with stripped peptide lengths (# amino acids)

For technical duplicates, I exported PeptideProphet results as both Excel files and as CSVs into my ETNP 2017 git repo:

In [2]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP


In [3]:
ls

 RAL4_MED2_combine_Comet25_AA_NAAF.csv
 RAL4_MED2_combine_Comet2.5Xcorr_proteins.txt
 RAL4_MED2_combine_Comet3_AA_NAAF.csv
 RAL4_MED2_combine_Comet3Xcorr_proteins.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_peptides_2.5XCorr.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_peptides.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_work.ods
 RAL4_MED2_trypsin_1_PTMopt_Comet_unfiltered.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet.xlsx
 RAL4_MED2_trypsin_1_PTMopt_PepProp90.csv
 RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped.csv
 RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped_peptides
 RAL4_MED2_trypsin_1_PTMopt_PepProp90.xlsx
 RAL4_MED2_trypsin_2_PTMopt_Comet.csv
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped.csv
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_peptides_2.5XCorr.txt
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_peptides.txt
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_work.ods
 RAL4_MED2_trypsi

In [4]:
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [15]:
#read the CSV into a datadrame we name 'comet' using the pandas read_csv function
peppro1 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_PepProp90.csv")
peppro2 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_2_PTMopt_PepProp90.csv")

print(len(peppro1))
print(len(peppro2))

#print(peppro1.columns)

frames = [peppro1, peppro2]

# concatenate dataframes
peppro = pd.concat(frames, sort=False)
print(len(peppro))

#look at the dataframe
peppro.head()

20421
20179
40600


Unnamed: 0,probability,spectrum,expect,ions,peptide,protein,calc_neutral_pep_mass
0,1.0,022016_RAL4_95_MED2_trypsin_1.04846.04846.2,0.000252,18-Jul,K.VGAATETEM[147.04]K.Y,"PMM0452,PMM1436",1051.4856
1,1.0,022016_RAL4_95_MED2_trypsin_1.05313.05313.3,0.000603,Aug-48,K.VEAHPIPEHPRPR.R,PMM0760,1533.8164
2,1.0,022016_RAL4_95_MED2_trypsin_1.05371.05371.3,3.79e-07,17/48,K.FHSAEVDSETDHR.V,PMM0613,1528.6542
3,1.0,022016_RAL4_95_MED2_trypsin_1.05377.05377.2,3.59e-07,16-Sep,K.YC[160.03]DDAINKR.E,PMM0574,1153.5186
4,1.0,022016_RAL4_95_MED2_trypsin_1.05457.05457.4,0.00607,Jul-90,R.HGGGAFSGKDPTKVDR.S,PMM0311,1627.8067


The peptide column has the residues before and after the tryptic terminii as well as masses of modified residues (e.g., 160.03 Da for carbamidomethylated cysteine). We want to make new columns with all that information and make a column with only the 'stripped' peptide sequence that's just amino acids - this we can then align against other sequences, for example.

Modified residues were allowed for:

    fixed carbamodimethylation of cysteine 57.021464 C
    varialbe oxidation of methionine: 15.9949 M
    variable deamidation of asparagine, glumatine: 0.984016 NQ
    variable iron cation on lysine: 54.010565 K
    variable methylation of lysine and arginine: 14.015650 KR

We'll then write this manipulated dataframe to a new file.

In [16]:
# use str.strip with indexing by str[0] to add a column with the peptide's left terminus
peppro['L terminus'] = peppro['peptide'].astype(str).str[0]

# use str.strip with indexing by str[-1] to add a column with the peptide's left terminus
peppro['R terminus'] = peppro['peptide'].str.strip().str[-1]

# use a count function to enumerate the # of ccarbamidomethylated C's in each peptide
peppro['c-carb'] = peppro['peptide'].str.count("160.03")

# use a count function to enumerate the # of oxidized M's in each peptide
peppro['m-oxid'] = peppro['peptide'].str.count("147.04")

# use a count function to enumerate the # of deamidated N's in each peptide
peppro['n-deam'] = peppro['peptide'].str.count("115.03")

# use a count function to enumerate the # of deamidated Q's in each peptide
peppro['q-deam'] = peppro['peptide'].str.count("129.04")

# use a count function to enumerate the # of iron adducted K's in each peptide
peppro['k-iron'] = peppro['peptide'].str.count("182.11")

# use a count function to enumerate the # of methylated K's in each peptide
peppro['k-meth'] = peppro['peptide'].str.count("142.11")

# use a count function to enumerate the # of methylated R's in each peptide
peppro['r-meth'] = peppro['peptide'].str.count("170.12")

# create a column with 'stripped' peptide sequences using strip
peppro['stripped peptide'] = peppro['peptide'].str[2:].str[:-2].str.replace(r"\[.*\]","")

# add a column with the stripped peptide length (number of AAs)
peppro['stripped length'] = peppro['stripped peptide'].apply(len)

# write modified dataframe to new txt file, same name + 'stripped'
peppro.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_combine_PTMopt_PepProp90_stripped.csv")

# check out the results
peppro.head()

Unnamed: 0,probability,spectrum,expect,ions,peptide,protein,calc_neutral_pep_mass,L terminus,R terminus,c-carb,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length
0,1.0,022016_RAL4_95_MED2_trypsin_1.04846.04846.2,0.000252,18-Jul,K.VGAATETEM[147.04]K.Y,"PMM0452,PMM1436",1051.4856,K,Y,0,1,0,0,0,0,0,VGAATETEMK,10
1,1.0,022016_RAL4_95_MED2_trypsin_1.05313.05313.3,0.000603,Aug-48,K.VEAHPIPEHPRPR.R,PMM0760,1533.8164,K,R,0,0,0,0,0,0,0,VEAHPIPEHPRPR,13
2,1.0,022016_RAL4_95_MED2_trypsin_1.05371.05371.3,3.79e-07,17/48,K.FHSAEVDSETDHR.V,PMM0613,1528.6542,K,V,0,0,0,0,0,0,0,FHSAEVDSETDHR,13
3,1.0,022016_RAL4_95_MED2_trypsin_1.05377.05377.2,3.59e-07,16-Sep,K.YC[160.03]DDAINKR.E,PMM0574,1153.5186,K,E,1,0,0,0,0,0,0,YCDDAINKR,9
4,1.0,022016_RAL4_95_MED2_trypsin_1.05457.05457.4,0.00607,Jul-90,R.HGGGAFSGKDPTKVDR.S,PMM0311,1627.8067,R,S,0,0,0,0,0,0,0,HGGGAFSGKDPTKVDR,16


In [17]:
# keep only peptides  >95% probability
# need to convert Xcorr column from strings to numeric so we can use loc
peppro['probability'] = pd.to_numeric(peppro['probability'])

peppro95 = peppro.loc[peppro['probability'] >= 0.95]

# keep only peptide column 
pep95 = peppro95[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep95.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_combine_PTMopt_PepPro_stripped_peptides_95.txt", header=False, index=False)

# removing redundancy
pep95dedup = pd.DataFrame.drop_duplicates(pep95)

print("# redundant peppro peptides >2.5 XCorr", len(pep95))
print("# nonredundant peppro peptides >2.5 XCOrr", len(pep95dedup))

pep95.head()

# redundant peppro peptides >2.5 XCorr 38886
# nonredundant peppro peptides >2.5 XCOrr 12285


Unnamed: 0,stripped peptide
0,VGAATETEMK
1,VEAHPIPEHPRPR
2,FHSAEVDSETDHR
3,YCDDAINKR
4,HGGGAFSGKDPTKVDR
