## Sequence to protein mapping with PepExplorer

### PepExplorer (Leprevost et al., 2014) is a sequence similarity-driven tool that takes the output of our de novo algorithm (PEAK), which contain candidate sequences with PTMs and confidence scores, and maps them to a user-defined target-decoy sequence database.

### PepExplorer is part of the Pattern Lab for Proteomics suite of tools available for free download [here](http://proteomics.fiocruz.br/software/pepexplorer/).

### I used PepExplorer 0.1.0.78 on my PC (x64 Windows 10). The parameters for searching de novo results (>50% ALC) were:

    - Min AAs in peptide: 5
    - Decoy method: reverse
    - PEAKS 8.0 parameters
    - MinIdent: 80%
    
### I exported the results of the alignments of each MED4 replicate vs. the MED4 sequence database (proteome from GenBank) and moved the .txt results to my local machine in this git directory. Using LibreOffice Calc I also took out only the protein IDs/# of alignments/spec counts/description and saved as a .csv:

In [1]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [2]:
cd /home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/

/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt


In [14]:
# combining and exporting results for PepExplorer results from de novo peptide > 80% ALC
# read the protein files into dataframes and combined the replicates

pe1 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/RAL95_MED2_trypsin_1_PTMopt_DN80_PepExplorer-vs-MED4Graa_15ppm.csv")
pe2 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/RAL95_MED2_trypsin_2_PTMopt_DN80_PepExplorer-vs-MED4Graa_15ppm.csv")

frames = [pe1, pe2]

print(pe1.columns)
print(pe2.columns)

pe50prot = pe1.merge(pe1,  on='ProteinID')

# removing redundancy
pe50protdd = pd.DataFrame.drop_duplicates(pe50prot)

# how many redundant proteins?
print("# redundant PeaksDN50 proteins = ", len(pe50prot))

# how many nonredundant proteins?
print("# nonredundant PeaksDN50 proteins = ", len(pe50protdd))

# export as a .txt file without headers
pe50protdd.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/RAL4_MED2_combine_DN50PepEx_proteins.txt", header=False, index=False)

# export as a .txt file without headers
pe50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/RAL4_MED2_combine_DN50PepEx.csv")

# take a look
pe50.head()

Index(['ProteinID', 'Alignments', 'SpecCounts', 'Unique', 'Coverage',
       'Protein description'],
      dtype='object')
Index(['ProteinID', 'Alignments', 'SpecCounts', 'Unique', 'Coverage',
       'Protein description'],
      dtype='object')
# redundant PeaksDN50 proteins =  1068
# nonredundant PeaksDN50 proteins =  1068


Unnamed: 0,ProteinID,Alignments,SpecCounts,Unique,Coverage,Protein description
0,PMM0001,8,9,8,0.179,| dnaN | DNA polymerase III subunit beta8980.1...
1,PMM0002,2,2,1,0.038,| PMM0002 | hypothetical protein2210.038| PMM0...
2,PMM0003,8,8,8,0.092,| purL | phosphoribosylformylglycinamidine syn...
3,PMM0004,2,2,2,0.029,| purF | amidophosphoribosyltransferase2220.02...
4,PMM0005,9,9,9,0.098,| PMM0005 | DNA gyrase/topoisomerase IV9990.09...


In [6]:
# combining and exporting results for PepExplorer results from de novo peptide > 80% ALC
# read the protein files into dataframes and combined the replicates

pe1 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/RAL95_MED2_trypsin_1_PTMopt_DN80_PepExplorer-vs-MED4Graa_15ppm.csv")
pe2 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/RAL95_MED2_trypsin_2_PTMopt_DN80_PepExplorer-vs-MED4Graa_15ppm.csv")

frames = [pe1, pe2]

# concatenate dataframes
pe80 = pd.concat(frames, sort=False)

# let's also make a dataframe that's just the proteins
pe80prot = pe80[['ProteinID']].copy() 

# removing redundancy
pe80protdd = pd.DataFrame.drop_duplicates(pe80prot)

# how many redundant proteins?
print("# redundant PeaksDN80 proteins = ", len(pe80prot))

# how many nonredundant proteins?
print("# nonredundant PeaksDN80 proteins = ", len(pe80protdd))

# export as a .txt file without headers
pe80protdd.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/RAL4_MED2_combine_PTM-opt_DN80_15ppm_PepEx_proteins.txt", header=False, index=False)

# export as a .txt file without headers
pe80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/analyses/pronovo-2020/pepexplorer/med4-PTMopt/RAL4_MED2_combine_PTM-opt_DN80_15ppm_PepEx.csv")

# take a look
pe80.head()

# redundant PeaksDN80 proteins =  2118
# nonredundant PeaksDN80 proteins =  1167


Unnamed: 0,ProteinID,Alignments,SpecCounts,Unique,Coverage,Protein description,Unnamed: 6
0,PMM0001,8,9,8,0.179,| dnaN | DNA polymerase III subunit beta8980.1...,
1,PMM0002,2,2,1,0.038,| PMM0002 | hypothetical protein2210.038| PMM0...,
2,PMM0003,8,8,8,0.092,| purL | phosphoribosylformylglycinamidine syn...,
3,PMM0004,2,2,2,0.029,| purF | amidophosphoribosyltransferase2220.02...,
4,PMM0005,9,9,9,0.098,| PMM0005 | DNA gyrase/topoisomerase IV9990.09...,


### Let's compare to our Comet results!

In [22]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP


In [23]:
ls

 RAL4_MED2_combine_Comet2.5Xcorr_proteins.txt
 RAL4_MED2_combine_Comet3_AA_NAAF.csv
 RAL4_MED2_combine_Comet3Xcorr_proteins.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_peptides_2.5XCorr.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_peptides.txt
 RAL4_MED2_trypsin_1_PTMopt_Comet_stripped_work.ods
 RAL4_MED2_trypsin_1_PTMopt_Comet_unfiltered.csv
 RAL4_MED2_trypsin_1_PTMopt_Comet.xlsx
 RAL4_MED2_trypsin_1_PTMopt_PepProp90.csv
 RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped.csv
 RAL4_MED2_trypsin_1_PTMopt_PepProp90_stripped_peptides
 RAL4_MED2_trypsin_1_PTMopt_PepProp90.xlsx
 RAL4_MED2_trypsin_2_PTMopt_Comet.csv
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped.csv
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_peptides_2.5XCorr.txt
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_peptides.txt
 RAL4_MED2_trypsin_2_PTMopt_Comet_stripped_work.ods
 RAL4_MED2_trypsin_2_PTMopt_Comet.xlsx
 RAL4_MED2_trypsi

In [24]:
# export for XCorr >3
# read the CSVs of each replicate into a datadrame we name 'comet' using the pandas read_csv function
comet1 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_Comet.csv")
comet2 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_2_PTMopt_Comet.csv")

frames = [comet1, comet2]

# concatenate dataframes
cometall = pd.concat(frames, sort=False)

# get rid of these few weirdos
cometall = cometall[cometall.xcorr != '[unavailable]']

# let's only keep PSMs > XCorr 3 (see MED4 Comet notebook - this keeps an FDR < 1%)
# need to convert Xcorr column from strings to numeric so we can use loc
cometall['xcorr'] = pd.to_numeric(cometall['xcorr'])
comet3 = cometall.loc[cometall['xcorr'] >= 3]

# getting rid of any DECOY protein IDs
cometpmm3 = comet3[~comet3['protein'].str.contains("DECOY")]

# let's also make a dataframe that's just the proteins
cometprot = cometpmm3[['protein']].copy() 

# let's also deduplicate
cometprotdd = pd.DataFrame.drop_duplicates(cometprot)

print("# redundant Comet peptides", len(cometprot))
print("# nonredundant Comet peptides", len(cometprotdd))

# export as a .txt file without headers
cometprotdd.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_combine_Comet3Xcorr_proteins.txt", header=False, index=False)

#look at the dataframe
cometprotdd.head()

# redundant Comet peptides 26923
# nonredundant Comet peptides 1294


Unnamed: 0,protein
0,PMM0035
1,PMM1609
2,PMM1436
3,PMM1191
5,PMM0070


In [25]:
# export for XCorr >2.5
# read the CSVs of each replicate into a datadrame we name 'comet' using the pandas read_csv function
comet1 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_1_PTMopt_Comet.csv")
comet2 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_trypsin_2_PTMopt_Comet.csv")

frames = [comet1, comet2]

# concatenate dataframes
cometall = pd.concat(frames, sort=False)

# get rid of these few weirdos
cometall = cometall[cometall.xcorr != '[unavailable]']

# let's only keep PSMs > XCorr 2.5 (see MED4 Comet notebook - this keeps an FDR < 1%)
# need to convert Xcorr column from strings to numeric so we can use loc
cometall['xcorr'] = pd.to_numeric(cometall['xcorr'])
comet25 = cometall.loc[cometall['xcorr'] >= 2.5]

# getting rid of any DECOY protein IDs
cometpmm25 = comet25[~comet25['protein'].str.contains("DECOY")]

# let's also make a dataframe that's just the proteins
cometprot = cometpmm25[['protein']].copy() 

# let's also deduplicate
cometprotdd25 = pd.DataFrame.drop_duplicates(cometprot)

print("# redundant Comet peptides", len(cometprot))
print("# nonredundant Comet peptides", len(cometprotdd))

# export as a .txt file without headers
cometprotdd25.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/RAL4_95_MED2_trypsin/TPP/RAL4_MED2_combine_Comet2.5Xcorr_proteins.txt", header=False, index=False)

#look at the dataframe
cometprotdd25.head()

# redundant Comet peptides 35928
# nonredundant Comet peptides 1294


Unnamed: 0,protein
0,PMM0035
1,PMM1609
2,PMM1436
3,PMM1191
5,PMM0070
