## Manipulation of PeaskDB de novo-assisted database search results of Trocas 7 (April, 2019 high water) lower Amazon river proteomics LC-MS/MS data using python.

Starting with:

PeaksDB search results (.csv) of database searches against Henrique's Amazon metagenome (+Hi3)
All samples (duplicates of most) included, so `Area` and `Spectral Counts` columns for each injection
These were all searched with 15 ppm precursor tolerance and 0.5 ppm fragement ion tolerance
Exported at <1.0% FDR

Goal:

Files with stripped (no PTMs) peptide lists and
Columns with #'s of each modification in every sequence
Column with stripped peptide lengths (# amino acids)


In [1]:
cd /home/millieginty/Documents/git-repos/amazon/data/TROCAS7_Fusion_Apr2021_PEAKS_76-all-samples/

/home/millieginty/Documents/git-repos/amazon/data/TROCAS7_Fusion_Apr2021_PEAKS_76-all-samples


In [2]:
ls

Apr21-peaks76-DB-peptide.csv           Apr21-peaks76-DB-proteins.fasta
Apr21-peaks76-DB-protein-peptides.csv  Apr21-peaks76-DB-search-psm.csv
Apr21-peaks76-DB-proteins.csv          Apr21-peaks76-dno.csv


In [3]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [4]:
# read the CSV into a dataframe using the pandas read_csv function
pdb_dup = pd.read_csv("/home/millieginty/Documents/git-repos/amazon/data/TROCAS7_Fusion_Apr2021_PEAKS_76-all-samples/Apr21-peaks76-DB-peptide.csv")

# remove redundant rows
pdb = pd.DataFrame.drop_duplicates(pdb_dup)

print(pdb.columns)

#remmove spec and accession columns because they mess parsing up

del pdb['Accession']
del pdb['PTM']
del pdb['AScore']

# get rid of all the spectral count #s, we're fine with Area
pdb = pdb[pdb.columns.drop(list(pdb.filter(regex='Spec')))]

mean_length = pdb['Length'].mean()
print('mean peptide length:', mean_length)

print("# redundant peaksdb peptides in combined dataframe", len(pdb_dup))
print("# nonredundant peaksdb peptides in combined dataframe", len(pdb))

#look at the dataframe
pdb.head()

Index(['Peptide', '-10lgP', 'Mass', 'Length', 'ppm', 'm/z', 'RT',
       'Area Trocas7-302-Bay', 'Area Trocas7-306-Chav',
       'Area Trocas7-310-SMCP', 'Area Trocas7-318-NMCP',
       'Area Trocas7-402-Bay', 'Area Trocas7-406-Chav',
       'Area Trocas7-410-SMCP', 'Area Trocas7-417-NMCP',
       'Area Trocas7-102-Bay', 'Area Trocas7-106-Chav',
       'Area Trocas7-206-Chav', 'Area Trocas7-110-SMCP',
       'Area Trocas7-126-NMCP', 'Area Trocas7-202-Bay',
       'Area Trocas7-210-SMCP', 'Area Trocas7-410-SMCP-DUP',
       'Area Trocas7-226-NMCP', 'Area Trocas7-303-Bay',
       'Area Trocas7-310-SMCP-DUP', 'Area Trocas7-102-Bay-DUP',
       'Area Trocas7-106-Chav-DUP', 'Area Trocas7-302-Bay-DUP',
       'Area Trocas7-306-Chav-DUP', 'Area Trocas7-503-Bay',
       'Area Trocas7-519-NMCP', 'Area Trocas7-318-NMCP-DUP',
       'Area Trocas7-402-Bay-DUP', 'Area Trocas7-406-Chav-DUP',
       'Area Trocas7-417-NMCP-DUP', 'Area Trocas7-307-Chav',
       'Area Trocas7-311-SMCP', 'Area Trocas7-31

Unnamed: 0,Peptide,-10lgP,Mass,Length,ppm,m/z,RT,Area Trocas7-302-Bay,Area Trocas7-306-Chav,Area Trocas7-310-SMCP,...,Area Trocas7-406-Chav-DUP,Area Trocas7-417-NMCP-DUP,Area Trocas7-307-Chav,Area Trocas7-311-SMCP,Area Trocas7-319-NMCP,Area Trocas7-507-Chav,Area Trocas7-511-SMCP,Fraction,Scan,Source File
0,LGEHNIDVLEGNEQFINAAK,112.16,2210.0967,20,-2.8,1106.0525,95.57,4370000.0,5920000.0,247000.0,...,90200.0,,18200.0,2030000.0,155000.0,283000.0,1050000.0,112,16192,20210411_Trocas7_668_SMCP311_DDA_120min_1.raw
1,SC(+57.02)AAAGTEC(+57.02)LISGWGNTK,104.26,1881.835,18,2.1,941.9268,93.16,3330000.0,3860000.0,1730000.0,...,1470000.0,570000.0,3370000.0,3140000.0,6150000.0,608000.0,534000.0,111,16381,20210411_Trocas7_667_Chav307_DDA_120min_1.raw
2,SSGSSYPSLLQC(+57.02)LK,88.49,1525.7446,14,2.0,763.8811,101.67,371000.0,1150000.0,367000.0,...,208000.0,,253000.0,178000.0,570000.0,,135000.0,111,18330,20210411_Trocas7_667_Chav307_DDA_120min_1.raw
3,SGGGGGGGLGSGGSIR,85.72,1231.5905,16,1.9,616.8036,39.3,,,,...,,,204000.0,114000.0,270000.0,,,113,6581,20210411_Trocas7_669_NMCP319_DDA_120min_1.raw
4,RHPYFYAPELLFFAKR,83.71,2054.0889,16,2.1,514.5306,131.89,,,,...,,,,,,4900.0,,81,25155,20210411_Trocas7_666_Bay303_DDA_120min_1.raw


In [5]:
# use a count function to enumerate the # of A's (alanines) in each peptide
pdb['A'] = pdb['Peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
pdb['C'] = pdb['Peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
pdb['D'] = pdb['Peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
pdb['E'] = pdb['Peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
pdb['F'] = pdb['Peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
pdb['G'] = pdb['Peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
pdb['H'] = pdb['Peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in pdb output, there will be no isoleucines (they're lumped in with leucines)
pdb['I'] = pdb['Peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
pdb['K'] = pdb['Peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
pdb['L'] = pdb['Peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
pdb['M'] = pdb['Peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
pdb['N'] = pdb['Peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
pdb['P'] = pdb['Peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
pdb['Q'] = pdb['Peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
pdb['R'] = pdb['Peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
pdb['S'] = pdb['Peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
pdb['T'] = pdb['Peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
pdb['V'] = pdb['Peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
pdb['W'] = pdb['Peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
pdb['Y'] = pdb['Peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
pdb['c-carb'] = pdb['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
pdb['m-oxid'] = pdb['Peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a count function to enumerate the # of oxidized K's in each peptide
#pdb['k-oxid'] = pdb['Peptide'].apply(lambda x: x.count('K(+15.99)'))

# use a count function to enumerate the # of oxidized P's in each peptide
#pdb['p-oxid'] = pdb['Peptide'].apply(lambda x: x.count('P(+15.99)'))

# use a count function to enumerate the # of oxidized R's in each peptide
#pdb['r-oxid'] = pdb['Peptide'].apply(lambda x: x.count('R(+15.99)'))

# use a count function to enumerate the # of oxidized Y's in each peptide
#pdb['y-oxid'] = pdb['Peptide'].apply(lambda x: x.count('Y(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
pdb['n-deam'] = pdb['Peptide'].apply(lambda x: x.count('N(+.98)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
pdb['q-deam'] = pdb['Peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of methylated K's in each peptide
#pdb['k-meth'] = pdb['Peptide'].apply(lambda x: x.count('K(+14.02)'))

# use a count function to enumerate the # of methylated R's in each peptide
#pdb['r-meth'] = pdb['Peptide'].apply(lambda x: x.count('R(+14.02)'))

# use a count function to enumerate the # of pyro glu Q's in each peptide
#pdb['q-pyro'] = pdb['Peptide'].apply(lambda x: x.count('Q(-17.03)'))

# use a count function to enumerate the # of acetylation of K's in each peptide
#pdb['k-acet'] = pdb['Peptide'].apply(lambda x: x.count('K(+42.01)'))

# create a column with 'stripped' peptide sequences using strip
pdb['stripped_peptide'] = pdb['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
pdb['stripped_length'] = pdb['stripped_peptide'].apply(len)

##pdb['NAAF_num.'] = pdb['Area'] / pdb['stripped_length']

# total the number of modifications in sequence
pdb['ptm-total'] = pdb['c-carb'] + pdb['m-oxid'] + pdb['n-deam'] + pdb['q-deam'] 

# turn all isoleucines into leucines
# this helps later in comparing Unipept peptides to PeaksDB and Comet ones
pdb['stripped_IL']= pdb['stripped_peptide'].str.replace('I','L')

# write modified dataframe to new txt file
pdb.to_csv("/home/millieginty/Documents/git-repos/amazon/data/processed/TROCAS7_Fusion_Apr2021-all-samples/Apr21-peaks76-DB-peptide-proc.csv")

# check out the results
pdb.head()

Unnamed: 0,Peptide,-10lgP,Mass,Length,ppm,m/z,RT,Area Trocas7-302-Bay,Area Trocas7-306-Chav,Area Trocas7-310-SMCP,...,W,Y,c-carb,m-oxid,n-deam,q-deam,stripped_peptide,stripped_length,ptm-total,stripped_IL
0,LGEHNIDVLEGNEQFINAAK,112.16,2210.0967,20,-2.8,1106.0525,95.57,4370000.0,5920000.0,247000.0,...,0,0,0,0,0,0,LGEHNIDVLEGNEQFINAAK,20,0,LGEHNLDVLEGNEQFLNAAK
1,SC(+57.02)AAAGTEC(+57.02)LISGWGNTK,104.26,1881.835,18,2.1,941.9268,93.16,3330000.0,3860000.0,1730000.0,...,1,0,2,0,0,0,SCLISGWGNTK,11,2,SCLLSGWGNTK
2,SSGSSYPSLLQC(+57.02)LK,88.49,1525.7446,14,2.0,763.8811,101.67,371000.0,1150000.0,367000.0,...,0,1,1,0,0,0,SSGSSYPSLLQCLK,14,1,SSGSSYPSLLQCLK
3,SGGGGGGGLGSGGSIR,85.72,1231.5905,16,1.9,616.8036,39.3,,,,...,0,0,0,0,0,0,SGGGGGGGLGSGGSIR,16,0,SGGGGGGGLGSGGSLR
4,RHPYFYAPELLFFAKR,83.71,2054.0889,16,2.1,514.5306,131.89,,,,...,0,2,0,0,0,0,RHPYFYAPELLFFAKR,16,0,RHPYFYAPELLFFAKR


In [31]:
# Used Libre Calc to separate out the T0 and T24 samples
# But still want to delete peptides that are all NaN in all T0 or T24 samples
# read the CSV into a dataframe using the pandas read_csv function

time0db = pd.read_csv("/home/millieginty/Documents/git-repos/amazon/data/processed/TROCAS7_Fusion_Apr2021-all-samples/Apr21-peaks76-DB-peptide-time0.csv", index_col=0)
time24db = pd.read_csv("/home/millieginty/Documents/git-repos/amazon/data/processed/TROCAS7_Fusion_Apr2021-all-samples/Apr21-peaks76-DB-peptide-time24.csv", index_col=0)

print('rows in undropped T0 df:', len(time0db))
print('rows in undropped T24 df:', len(time24db))

time0db_clean = time0db.dropna(how='all')
time24db_clean = time24db.dropna(how='all')

print('rows in undropped T0 df:', len(time0db_clean))
print('rows in undropped T24 df:', len(time24db_clean))

time0db_clean['peptide'] = time0db_clean.index
time24db_clean['peptide'] = time24db_clean.index

time0db_clean.head()

rows in undropped T0 df: 902
rows in undropped T24 df: 902
rows in undropped T0 df: 298
rows in undropped T24 df: 771


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time0db_clean['peptide'] = time0db_clean.index
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time24db_clean['peptide'] = time24db_clean.index


Unnamed: 0_level_0,Area Trocas7-102-Bay,Area Trocas7-106-Chav,Area Trocas7-206-Chav,Area Trocas7-110-SMCP,Area Trocas7-126-NMCP,Area Trocas7-202-Bay,Area Trocas7-210-SMCP,Area Trocas7-226-NMCP,Area Trocas7-102-Bay-DUP,Area Trocas7-106-Chav-DUP,peptide
Peptide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
LGEHNIDVLEGNEQFINAAK,21900000.0,14100000.0,24800000.0,3990000.0,3670000.0,12900000.0,2850000.0,3650000.0,13300000.0,7910000.0,LGEHNIDVLEGNEQFINAAK
SC(+57.02)AAAGTEC(+57.02)LISGWGNTK,3960000.0,4480000.0,2320000.0,881000.0,1950000.0,1120000.0,2010000.0,1380000.0,5310000.0,5150000.0,SC(+57.02)AAAGTEC(+57.02)LISGWGNTK
SSGSSYPSLLQC(+57.02)LK,637000.0,900000.0,627000.0,182000.0,332000.0,241000.0,517000.0,300000.0,837000.0,1040000.0,SSGSSYPSLLQC(+57.02)LK
SGGGGGGGLGSGGSIR,,36900.0,,,,,,,,,SGGGGGGGLGSGGSIR
LGEHNIDVLEGNEQFIN(+.98)AAK,337000.0,216000.0,259000.0,,44800.0,96800.0,,,172000.0,101000.0,LGEHNIDVLEGNEQFIN(+.98)AAK


In [29]:
# Now, clean up the T0 and T24 peptides and get their stripped/AA

# use a count function to enumerate the # of A's (alanines) in each peptide
time0db_clean['A'] = time0db_clean['peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
time0db_clean['C'] = time0db_clean['peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
time0db_clean['D'] = time0db_clean['peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
time0db_clean['E'] = time0db_clean['peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
time0db_clean['F'] = time0db_clean['peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
time0db_clean['G'] = time0db_clean['peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
time0db_clean['H'] = time0db_clean['peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in time0db_clean output, there will be no isoleucines (they're lumped in with leucines)
time0db_clean['I'] = time0db_clean['peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
time0db_clean['K'] = time0db_clean['peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
time0db_clean['L'] = time0db_clean['peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
time0db_clean['M'] = time0db_clean['peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
time0db_clean['N'] = time0db_clean['peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
time0db_clean['P'] = time0db_clean['peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
time0db_clean['Q'] = time0db_clean['peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
time0db_clean['R'] = time0db_clean['peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
time0db_clean['S'] = time0db_clean['peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
time0db_clean['T'] = time0db_clean['peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
time0db_clean['V'] = time0db_clean['peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
time0db_clean['W'] = time0db_clean['peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
time0db_clean['Y'] = time0db_clean['peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
time0db_clean['c-carb'] = time0db_clean['peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
time0db_clean['m-oxid'] = time0db_clean['peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a count function to enumerate the # of oxidized K's in each peptide
#time0db_clean['k-oxid'] = time0db_clean['peptide'].apply(lambda x: x.count('K(+15.99)'))

# use a count function to enumerate the # of oxidized P's in each peptide
#time0db_clean['p-oxid'] = time0db_clean['peptide'].apply(lambda x: x.count('P(+15.99)'))

# use a count function to enumerate the # of oxidized R's in each peptide
#time0db_clean['r-oxid'] = time0db_clean['peptide'].apply(lambda x: x.count('R(+15.99)'))

# use a count function to enumerate the # of oxidized Y's in each peptide
#time0db_clean['y-oxid'] = time0db_clean['peptide'].apply(lambda x: x.count('Y(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
time0db_clean['n-deam'] = time0db_clean['peptide'].apply(lambda x: x.count('N(+.98)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
time0db_clean['q-deam'] = time0db_clean['peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of methylated K's in each peptide
#time0db_clean['k-meth'] = time0db_clean['peptide'].apply(lambda x: x.count('K(+14.02)'))

# use a count function to enumerate the # of methylated R's in each peptide
#time0db_clean['r-meth'] = time0db_clean['peptide'].apply(lambda x: x.count('R(+14.02)'))

# use a count function to enumerate the # of pyro glu Q's in each peptide
#time0db_clean['q-pyro'] = time0db_clean['peptide'].apply(lambda x: x.count('Q(-17.03)'))

# use a count function to enumerate the # of acetylation of K's in each peptide
#time0db_clean['k-acet'] = time0db_clean['peptide'].apply(lambda x: x.count('K(+42.01)'))

# create a column with 'stripped' peptide sequences using strip
time0db_clean['stripped_peptide'] = time0db_clean['peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
time0db_clean['stripped_length'] = time0db_clean['stripped_peptide'].apply(len)

##time0db_clean['NAAF_num.'] = time0db_clean['Area'] / time0db_clean['stripped_length']

# total the number of modifications in sequence
time0db_clean['ptm-total'] = time0db_clean['c-carb'] + time0db_clean['m-oxid'] + time0db_clean['n-deam'] + time0db_clean['q-deam'] 

# turn all isoleucines into leucines
# this helps later in comparing Unipept peptides to PeaksDB and Comet ones
time0db_clean['stripped_IL']= time0db_clean['stripped_peptide'].str.replace('I','L')

# write modified dataframe to new txt file
time0db_clean.to_csv("/home/millieginty/Documents/git-repos/amazon/data/processed/TROCAS7_Fusion_Apr2021-all-samples/Apr21-peaks76-DB-peptide-proc-time0.csv")

# check out the results
time0db_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time0db_clean['A'] = time0db_clean['peptide'].str.count("A")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time0db_clean['C'] = time0db_clean['peptide'].str.count("C")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time0db_clean['D'] = time0db_clean['peptide'].str.count("D")
A value is trying to b

Unnamed: 0_level_0,Area Trocas7-102-Bay,Area Trocas7-106-Chav,Area Trocas7-206-Chav,Area Trocas7-110-SMCP,Area Trocas7-126-NMCP,Area Trocas7-202-Bay,Area Trocas7-210-SMCP,Area Trocas7-226-NMCP,Area Trocas7-102-Bay-DUP,Area Trocas7-106-Chav-DUP,...,W,Y,c-carb,m-oxid,n-deam,q-deam,stripped_peptide,stripped_length,ptm-total,stripped_IL
Peptide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LGEHNIDVLEGNEQFINAAK,21900000.0,14100000.0,24800000.0,3990000.0,3670000.0,12900000.0,2850000.0,3650000.0,13300000.0,7910000.0,...,0,0,0,0,0,0,LGEHNIDVLEGNEQFINAAK,20,0,LGEHNLDVLEGNEQFLNAAK
SC(+57.02)AAAGTEC(+57.02)LISGWGNTK,3960000.0,4480000.0,2320000.0,881000.0,1950000.0,1120000.0,2010000.0,1380000.0,5310000.0,5150000.0,...,1,0,2,0,0,0,SCLISGWGNTK,11,2,SCLLSGWGNTK
SSGSSYPSLLQC(+57.02)LK,637000.0,900000.0,627000.0,182000.0,332000.0,241000.0,517000.0,300000.0,837000.0,1040000.0,...,0,1,1,0,0,0,SSGSSYPSLLQCLK,14,1,SSGSSYPSLLQCLK
SGGGGGGGLGSGGSIR,,36900.0,,,,,,,,,...,0,0,0,0,0,0,SGGGGGGGLGSGGSIR,16,0,SGGGGGGGLGSGGSLR
LGEHNIDVLEGNEQFIN(+.98)AAK,337000.0,216000.0,259000.0,,44800.0,96800.0,,,172000.0,101000.0,...,0,0,0,0,1,0,LGEHNIDVLEGNEQFINAAK,20,1,LGEHNLDVLEGNEQFLNAAK


In [32]:
# Now, clean up the T0 and T24 peptides and get their stripped/AA

# use a count function to enumerate the # of A's (alanines) in each peptide
time24db_clean['A'] = time24db_clean['peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
time24db_clean['C'] = time24db_clean['peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
time24db_clean['D'] = time24db_clean['peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
time24db_clean['E'] = time24db_clean['peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
time24db_clean['F'] = time24db_clean['peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
time24db_clean['G'] = time24db_clean['peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
time24db_clean['H'] = time24db_clean['peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in time24db_clean output, there will be no isoleucines (they're lumped in with leucines)
time24db_clean['I'] = time24db_clean['peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
time24db_clean['K'] = time24db_clean['peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
time24db_clean['L'] = time24db_clean['peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
time24db_clean['M'] = time24db_clean['peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
time24db_clean['N'] = time24db_clean['peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
time24db_clean['P'] = time24db_clean['peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
time24db_clean['Q'] = time24db_clean['peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
time24db_clean['R'] = time24db_clean['peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
time24db_clean['S'] = time24db_clean['peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
time24db_clean['T'] = time24db_clean['peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
time24db_clean['V'] = time24db_clean['peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
time24db_clean['W'] = time24db_clean['peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
time24db_clean['Y'] = time24db_clean['peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
time24db_clean['c-carb'] = time24db_clean['peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
time24db_clean['m-oxid'] = time24db_clean['peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a count function to enumerate the # of oxidized K's in each peptide
#time24db_clean['k-oxid'] = time24db_clean['peptide'].apply(lambda x: x.count('K(+15.99)'))

# use a count function to enumerate the # of oxidized P's in each peptide
#time24db_clean['p-oxid'] = time24db_clean['peptide'].apply(lambda x: x.count('P(+15.99)'))

# use a count function to enumerate the # of oxidized R's in each peptide
#time24db_clean['r-oxid'] = time24db_clean['peptide'].apply(lambda x: x.count('R(+15.99)'))

# use a count function to enumerate the # of oxidized Y's in each peptide
#time24db_clean['y-oxid'] = time24db_clean['peptide'].apply(lambda x: x.count('Y(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
time24db_clean['n-deam'] = time24db_clean['peptide'].apply(lambda x: x.count('N(+.98)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
time24db_clean['q-deam'] = time24db_clean['peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of methylated K's in each peptide
#time24db_clean['k-meth'] = time24db_clean['peptide'].apply(lambda x: x.count('K(+14.02)'))

# use a count function to enumerate the # of methylated R's in each peptide
#time24db_clean['r-meth'] = time24db_clean['peptide'].apply(lambda x: x.count('R(+14.02)'))

# use a count function to enumerate the # of pyro glu Q's in each peptide
#time24db_clean['q-pyro'] = time24db_clean['peptide'].apply(lambda x: x.count('Q(-17.03)'))

# use a count function to enumerate the # of acetylation of K's in each peptide
#time24db_clean['k-acet'] = time24db_clean['peptide'].apply(lambda x: x.count('K(+42.01)'))

# create a column with 'stripped' peptide sequences using strip
time24db_clean['stripped_peptide'] = time24db_clean['peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
time24db_clean['stripped_length'] = time24db_clean['stripped_peptide'].apply(len)

##time24db_clean['NAAF_num.'] = time24db_clean['Area'] / time24db_clean['stripped_length']

# total the number of modifications in sequence
time24db_clean['ptm-total'] = time24db_clean['c-carb'] + time24db_clean['m-oxid'] + time24db_clean['n-deam'] + time24db_clean['q-deam'] 

# turn all isoleucines into leucines
# this helps later in comparing Unipept peptides to PeaksDB and Comet ones
time24db_clean['stripped_IL']= time24db_clean['stripped_peptide'].str.replace('I','L')

# write modified dataframe to new txt file
time24db_clean.to_csv("/home/millieginty/Documents/git-repos/amazon/data/processed/TROCAS7_Fusion_Apr2021-all-samples/Apr21-peaks76-DB-peptide-proc-time24.csv")

# check out the results
time24db_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time24db_clean['A'] = time24db_clean['peptide'].str.count("A")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time24db_clean['C'] = time24db_clean['peptide'].str.count("C")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time24db_clean['D'] = time24db_clean['peptide'].str.count("D")
A value is tryin

Unnamed: 0_level_0,Area Trocas7-302-Bay,Area Trocas7-306-Chav,Area Trocas7-310-SMCP,Area Trocas7-318-NMCP,Area Trocas7-402-Bay,Area Trocas7-406-Chav,Area Trocas7-410-SMCP,Area Trocas7-417-NMCP,Area Trocas7-410-SMCP-DUP,Area Trocas7-303-Bay,...,W,Y,c-carb,m-oxid,n-deam,q-deam,stripped_peptide,stripped_length,ptm-total,stripped_IL
Peptide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LGEHNIDVLEGNEQFINAAK,4370000.0,5920000.0,247000.0,864000.0,858000.0,634000.0,334000.0,,25800.0,3330000.0,...,0,0,0,0,0,0,LGEHNIDVLEGNEQFINAAK,20,0,LGEHNLDVLEGNEQFLNAAK
SC(+57.02)AAAGTEC(+57.02)LISGWGNTK,3330000.0,3860000.0,1730000.0,7690000.0,749000.0,822000.0,921000.0,434000.0,1520000.0,5880000.0,...,1,0,2,0,0,0,SCLISGWGNTK,11,2,SCLLSGWGNTK
SSGSSYPSLLQC(+57.02)LK,371000.0,1150000.0,367000.0,1510000.0,159000.0,156000.0,186000.0,114000.0,340000.0,379000.0,...,0,1,1,0,0,0,SSGSSYPSLLQCLK,14,1,SSGSSYPSLLQCLK
SGGGGGGGLGSGGSIR,,,,,,,,,,532000.0,...,0,0,0,0,0,0,SGGGGGGGLGSGGSIR,16,0,SGGGGGGGLGSGGSLR
RHPYFYAPELLFFAKR,,,,,,,,,,47900.0,...,0,2,0,0,0,0,RHPYFYAPELLFFAKR,16,0,RHPYFYAPELLFFAKR


In [None]:
# write modified dataframe to new txt file