### Manipulation of Comet identified+fungi database search results of ETNP 2017 P2 samples LC-MS/MS data using python.

Starting with:

    Comet search results (.csv) of PTM-optimized database searches against ETNP-identidied + fungal proteins
    These were all searched with 15 ppm precursor tolerance and 0.5 ppm fragement ion tolerance
    Search database included marine fungi and labrinthulomyces discovered using de novo peptide sequencing
        And unlike main searches, used only previously identified ETNP proteins (4,000 ish)
    XInteract file includes precursor intensities and protein descriptions

Goal:

    Files with stripped (no PTMs) peptide lists and
    Columns with #'s of each modification in every sequence
    Column with stripped peptide lengths (# amino acids)
    
### To use for a different file:

#### 1. Change the input file name in *IN 4*
#### 2. Use 'find + replace' (Esc + F) to replace the running # (e.g., 233) for another
#### 3. Update the NAAF factor calculated in *IN 6* into *IN 7*

We don't have technical duplicates here, sadly, unlike the MED4 Pro samples. I exported Comet search results after running through XInteract and saving as `.xlsx` files (pandas really doesn't like to read the `xls` verions of the XInteract output becase there are so many characters in the `protein` column for these samples) into my ETNP 2017 git repo:

In [1]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17-fungi-searches/SKQ17-Comet-fungi-searches/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17-fungi-searches/SKQ17-Comet-fungi-searches


In [2]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [3]:
ls

ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_Comet15_ETNPfungidb_03FDR.csv
ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_Comet15_ETNPfungidb_03FDR.xlsx
ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_Comet15_ETNPfungidb.csv
ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_Comet15_ETNPfungidb_09FDR.csv
ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_Comet15_ETNPfungidb_09FDR.xlsx
ETNP-SKQ17-233-265m-0.3-JA4_PTMopt_Comet15_ETNPfungidb.csv
ETNP-SKQ17-243-965m-0.3-JA14_PTMopt_Comet15_ETNPfungidb_09FDR.csv
ETNP-SKQ17-243-965m-0.3-JA14_PTMopt_Comet15_ETNPfungidb_09FDR.xlsx
ETNP-SKQ17-243-965m-0.3-JA14_PTMopt_Comet15_ETNPfungidb.csv
ETNP-SKQ17-273-965m-trap_PTMopt_Comet15_ETNPfungidb_00FDR.csv
ETNP-SKQ17-273-965m-trap_PTMopt_Comet15_ETNPfungidb.csv
ETNP-SKQ17-278-265m-trap_PTMopt_Comet15_ETNPfungidb_00FDR.csv
ETNP-SKQ17-278-265m-trap_PTMopt_Comet15_ETNPfungidb.csv
ETNP-SKQ17-378-100m-trap_PTMopt_Comet15_ETNPfungidb_00FDR.csv
ETNP-SKQ17-378-100m-trap_PTMopt_Comet15_ETNPfungidb_00FDR.xlsx
ETNP-SKQ17-378-100m-trap_PTMopt_Comet15_ETNPfungidb

## 231: 100 m McLane pump filtered on 0.3 um GF-75

In [4]:
# read the CSV into a dataframe using the pandas read_excel function
#cometdup231 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/TPP-PTMopt/ETNP-SKQ17-TPP-PTMopt-hyroxylation/JA2_PTMopt_interact_quant_nopro.pep.csv", index_col='spectrum')

f_cometdup231 = pd.read_excel("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17-fungi-searches/SKQ17-Comet-fungi-searches/interact-20170410_JA2_01.xlsx")

# remove redundant rows
f_comet231 = pd.DataFrame.drop_duplicates(f_cometdup231)

print("# redundant Comet peptides in combined dataframe", len(f_cometdup231))
print("# nonredundant Comet peptides in combined dataframe", len(f_comet231))

#look at the dataframe
f_comet231.head()

# redundant Comet peptides in combined dataframe 19193
# nonredundant Comet peptides in combined dataframe 19193


Unnamed: 0,spectrum,xcorr,deltacn,expect,ions,peptide,protein,calc_neutral_pep_mass,protein_descr
0,20170410_JA2_01.34842.34842.3,5.552,0.678,7.37e-12,31/88,R.AIQQQIENPLAQQILSGELVPGK.V,gi|54036848|sp|P63284.1|CLPB_ECOLI,2473.354,RecName: Full=Chaperone protein ClpB; AltName:...
1,20170410_JA2_01.23179.23179.2,4.989,0.72,1.01e-15,29/42,K.IVVGGPYSSVSDAASSLDSSQK.S,"ETNP_90m_PROKKA_168214,ETNP_100m_PROKKA_52245,...",2153.0488,Photosystem II 12 kDa extrinsic protein
2,20170410_JA2_01.34897.34897.2,4.888,0.709,2.44e-13,25/44,R.AIQQQIENPLAQQILSGELVPGK.V,gi|54036848|sp|P63284.1|CLPB_ECOLI,2473.354,RecName: Full=Chaperone protein ClpB; AltName:...
3,20170410_JA2_01.23135.23135.3,4.853,0.638,2.34e-16,29/84,K.IVVGGPYSSVSDAASSLDSSQK.S,"ETNP_90m_PROKKA_168214,ETNP_100m_PROKKA_52245,...",2153.0488,Photosystem II 12 kDa extrinsic protein
4,20170410_JA2_01.28172.28172.2,4.826,0.693,5.01e-13,29/42,K.IVVGGPYSSVSDAASVLDGSQK.S,"ETNP_100m_PROKKA_02317,ETNP_100m_PROKKA_02317,...",2135.0746,Photosystem II 12 kDa extrinsic protein precursor


In [17]:
# get rid of rows where the xcorr is unavailable (usually 3 or so)
f_comet231 = f_comet231[f_comet231.xcorr != '[unavailable]']

# use str.strip with indexing by str[0] to add a column with the peptide's left terminus
f_comet231['L terminus'] = f_comet231['peptide'].astype(str).str[0]

# use str.strip with indexing by str[-1] to add a column with the peptide's left terminus
f_comet231['R terminus'] = f_comet231['peptide'].str.strip().str[-1]

# use a count function to enumerate the # of A's (alanines) in each peptide
#f_comet231['A'] = f_comet231['peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
#f_comet231['C'] = f_comet231['peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
#f_comet231['D'] = f_comet231['peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
#f_comet231['E'] = f_comet231['peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
#f_comet231['F'] = f_comet231['peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
#f_comet231['G'] = f_comet231['peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
#f_comet231['H'] = f_comet231['peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in f_comet231 output, there will be no isoleucines (they're lumped in with leucines)
#f_comet231['I'] = f_comet231['peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
#f_comet231['K'] = f_comet231['peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
#f_comet231['L'] = f_comet231['peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
#f_comet231['M'] = f_comet231['peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
#f_comet231['N'] = f_comet231['peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
#f_comet231['P'] = f_comet231['peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
#f_comet231['Q'] = f_comet231['peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
#f_comet231['R'] = f_comet231['peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
#f_comet231['S'] = f_comet231['peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
#f_comet231['T'] = f_comet231['peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
f_comet231['V'] = f_comet231['peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
#f_comet231['W'] = f_comet231['peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
#f_comet231['Y'] = f_comet231['peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
#f_comet231['c-carb'] = f_comet231['peptide'].str.count("160.03")

# use a count function to enumerate the # of oxidized M's in each peptide
#f_comet231['m-oxid'] = f_comet231['peptide'].str.count("147.04")

# use a count function to enumerate the # of deamidated N's in each peptide
#f_comet231['n-deam'] = f_comet231['peptide'].str.count("115.03")

# use a count function to enumerate the # of deamidated Q's in each peptide
#f_comet231['q-deam'] = f_comet231['peptide'].str.count("129.04")

# use a count function to enumerate the # of hydroxylated K's in each peptide
#f_comet231['k-hydr'] = f_comet231['peptide'].str.count("144.09")

# use a count function to enumerate the # of hydroxylated P's in each peptide
# I removed P hydroxyation in final searches because there were so few
#f_comet231['p-hydr'] = f_comet231['peptide'].str.count("131.05")

# use a count function to enumerate the # of methylated R's in each peptide
#f_comet231['r-meth'] = f_comet231['peptide'].str.count("170.12")

# create a column with 'stripped' peptide sequences using strip
#f_comet231['stripped peptide'] = f_comet231['peptide'].str[2:].str[:-2].str.replace(r"\[.*\]","")

# add a column with the stripped peptide length (number of AAs)
#f_comet231['stripped length'] = f_comet231['stripped peptide'].apply(len)

#f_comet231['NAAF num.'] = f_comet231['precursor_intensity'] / f_comet231['stripped length']

# total the number of modifications in sequence
#f_comet231['ptm-total'] = f_comet231['c-carb'] + f_comet231['m-oxid'] + f_comet231['n-deam'] + f_comet231['q-deam'] + f_comet231['k-hydr'] + f_comet231['r-meth']

# turn all isoleucines into leucines
# this helps later in comparing Unipept peptides to PeaksDB and Comet ones
f_comet231['stripped I-L']= f_comet231['stripped peptide'].str.replace('I','L')

# write modified dataframe to new txt file
f_comet231.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17-fungi-searches/SKQ17-Comet-fungi-searches/ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_Comet15_ETNPfungidb.csv")

# check out the results
f_comet231.head()

  res_values = method(rvalues)


Unnamed: 0,spectrum,xcorr,deltacn,expect,ions,peptide,protein,calc_neutral_pep_mass,protein_descr,L terminus,R terminus,V,stripped peptide,stripped I-L
0,20170410_JA2_01.34842.34842.3,5.552,0.678,7.37e-12,31/88,R.AIQQQIENPLAQQILSGELVPGK.V,gi|54036848|sp|P63284.1|CLPB_ECOLI,2473.354,RecName: Full=Chaperone protein ClpB; AltName:...,R,V,2,AIQQQIENPLAQQILSGELVPGK,ALQQQLENPLAQQLLSGELVPGK
1,20170410_JA2_01.23179.23179.2,4.989,0.72,1.01e-15,29/42,K.IVVGGPYSSVSDAASSLDSSQK.S,"ETNP_90m_PROKKA_168214,ETNP_100m_PROKKA_52245,...",2153.0488,Photosystem II 12 kDa extrinsic protein,K,S,3,IVVGGPYSSVSDAASSLDSSQK,LVVGGPYSSVSDAASSLDSSQK
2,20170410_JA2_01.34897.34897.2,4.888,0.709,2.44e-13,25/44,R.AIQQQIENPLAQQILSGELVPGK.V,gi|54036848|sp|P63284.1|CLPB_ECOLI,2473.354,RecName: Full=Chaperone protein ClpB; AltName:...,R,V,2,AIQQQIENPLAQQILSGELVPGK,ALQQQLENPLAQQLLSGELVPGK
3,20170410_JA2_01.23135.23135.3,4.853,0.638,2.34e-16,29/84,K.IVVGGPYSSVSDAASSLDSSQK.S,"ETNP_90m_PROKKA_168214,ETNP_100m_PROKKA_52245,...",2153.0488,Photosystem II 12 kDa extrinsic protein,K,S,3,IVVGGPYSSVSDAASSLDSSQK,LVVGGPYSSVSDAASSLDSSQK
4,20170410_JA2_01.28172.28172.2,4.826,0.693,5.01e-13,29/42,K.IVVGGPYSSVSDAASVLDGSQK.S,"ETNP_100m_PROKKA_02317,ETNP_100m_PROKKA_02317,...",2135.0746,Photosystem II 12 kDa extrinsic protein precursor,K,S,4,IVVGGPYSSVSDAASVLDGSQK,LVVGGPYSSVSDAASVLDGSQK


## Calculating the false discovery rate (% FDR)

### Filtering PSMs > a selected XCorr value and exporting peptides

In [6]:
# Let's separate out the decoy hits from the good ones

f_comet231pmm = f_comet231[~f_comet231['protein'].str.contains("DECOY")]
f_comet231dec = f_comet231[f_comet231['protein'].str.contains("DECOY")]

# how many PSM that are only PMM (proteins in the database)?

print("# real Comet PSMs", len(f_comet231pmm))

# compared to how many PSMs containing decoys?

print("# decoy Comet PSMs", len(f_comet231dec))

# calculate the bulk FDR (all PSMs so let's not beat ourselves up)

r = len(f_comet231pmm)
d = len(f_comet231dec)

FDR = d/r*100

print("False discovery rate = ", FDR)

# real Comet PSMs 11287
# decoy Comet PSMs 7906
False discovery rate =  70.04518472579073


In [16]:
# keep only peptides  >3 XCorr
# need to convert Xcorr column from strings to numeric so we can use loc
# note that pmm here just means 'non decoy', it's a vestige from the accession prefixes from the MED proteins
f_comet231['xcorr'] = pd.to_numeric(f_comet231['xcorr'])

f_comet231_3 = f_comet231.loc[f_comet231['xcorr'] >= 1.99]

# What's the FDR?

# Let's separate out the decoy hits from the good ones

f_comet231pmm3 = f_comet231_3[~f_comet231_3['protein'].str.contains("DECOY")]
f_comet231dec3 = f_comet231_3[f_comet231_3['protein'].str.contains("DECOY")]

# how many PSM that are only PMM (proteins in the database)?

print("# real Comet PSMs", len(f_comet231pmm3))

# compared to how many PSMs containing decoys?

print("# decoy Comet PSMs", len(f_comet231dec3))

# calculate the FDR 

r = len(f_comet231pmm3)
d = len(f_comet231dec3)

FDR = d/(d+r)*100

print("False discovery rate = ", FDR)

# write modified dataframe to new txt file

f_comet231pmm3.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17-fungi-searches/SKQ17-Comet-fungi-searches/ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_Comet15_ETNPfungidb_09FDR.csv")

# real Comet PSMs 865
# decoy Comet PSMs 8
False discovery rate =  0.9163802978235969
