## Evaluating the PTMs across peptides from different cellular compartments

### Beginning with:

    Exported peptides lists (.csvs) that contain the AAs with modifications. Want to combine peptides from the following:
    
     - from trypsin and no-digest searches
     - from DB and DN searches
     
    From 8 samples (4 timepoints and 2 treatments, trypsin- and naturallyd-digested
    
     - 324: Day 0 trypsin digested
     - 323: Day 2 trypsin digested
     - 324: Day 5 trypsin digested
     - 325: Day 12 trypsin digested
     - 331: Day 0 undigested
     - 330: Day 2 undigested
     - 331: Day 5 undigested
     - 332: Day 12 undigested
    
### Want:

    Text files with all the stripped (no mod) peptides for the following modifications:
        
        - Lysine acetylation
        - Asparagine demidation
        - Arginine methylation
        

In [1]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

## Combine 324

In [2]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/processed/PTM-cellular-compartment/to-combine/T5/

/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PTM-cellular-compartment/to-combine/T5


In [3]:
cat TW_324_T5_trypsin_noenz_combine_PTMopt_DB_FDR1_mod_peptides.txt TW_324_T5_trypsin_combine_PTMopt_DB_FDR1_mod_peptides.txt TW_324_T5_trypsin_combine_PTMopt_DN_mod_peptides.txt TW_324_T5_trypsin_noenz_combine_PTMopt_DN_mod_peptides.txt > all_324.csv

In [4]:
# read in the combined datafile as a dataframe

all_324 = pd.read_csv("all_324.csv", header = None)

all_324.columns = ['Peptide']

print('Total peptides:', len(all_324))

all_324.head()

Total peptides: 7765


Unnamed: 0,Peptide
0,VIGQNEAVDAVSNAIR
1,LPQVEGTGGDVQPSQDLVR
2,YPVFAQQGYSNPR
3,AGIHLPGSINYAGD
4,AIGPGIGQGNAAGQAVEGIAR


In [5]:
# take all lines if they contain deamindated asparagines and make new df

keep= ["N\(\+.98"]

N_deam_324 = all_324[all_324.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

N_deam_324['stripped_peptide'] = N_deam_324['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(N_deam_324))

# keep only stripped peptide column
ndeam_324_sp = N_deam_324[["stripped_peptide"]].dropna()

# write to txt file

ndeam_324_sp.to_csv('324-T5-combined-n-deam-stripped-peptides.txt', header=False, index=False)

N_deam_324.head()

Number of deamidated asparagine peptides: 1016


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  N_deam_324['stripped_peptide'] = N_deam_324['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
158,NN(+.98)PVLIGEPGVGK,NNPVLIGEPGVGK
238,LGEHN(+.98)IDVLEGN,LGEHNIDVLEGN
250,N(+.98)NPVLIGEPGVGK,
284,LGEHNIDVLEGN(+.98),
495,EGTN(+.98)DIVLE,EGTNDIVLE


In [6]:
# take all lines if they contain lysine acetylations and make new df

keep= ["K\(\+42.01"]

K_acet_324 = all_324[all_324.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

K_acet_324['stripped_peptide'] = K_acet_324['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of lysine acetylation peptides:', len(K_acet_324))

# keep only stripped peptide column
kacet_324_sp = K_acet_324[["stripped_peptide"]].dropna()

# write to txt file

kacet_324_sp.to_csv('324-T5-combined-k-acet-stripped-peptides.txt', header=False, index=False)

K_acet_324.head()

Number of lysine acetylation peptides: 328


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  K_acet_324['stripped_peptide'] = K_acet_324['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
872,NGALDFGWDK(+15.99)FDAETK(+42.01),NGALDFGWDK
1185,NGALDFGWDK(+42.01)FDAETK(+15.99),
1289,K(+42.01)LLLPK,KLLLPK
1340,K(+42.01)PEEVVK,KPEEVVK
1343,K(+42.01)ETSFAK,KETSFAK


In [7]:
# take all lines if they contain methylated arginines and make new df

keep= ["R\(\+14.02"]

R_meth_324 = all_324[all_324.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

R_meth_324['stripped_peptide'] = R_meth_324['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(R_meth_324))

# keep only stripped peptide column
rmeth_324_sp = R_meth_324[["stripped_peptide"]].dropna()

# write to txt file

rmeth_324_sp.to_csv('324-T5-combined-r-meth-stripped-peptides.txt', header=False, index=False)

R_meth_324.head()

Number of deamidated asparagine peptides: 427


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  R_meth_324['stripped_peptide'] = R_meth_324['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
1594,WSLAR(+14.02)AR,WSLARAR
1629,R(+14.02)PGC(+57.02)TPK,RTPK
1642,R(+14.02)LDN(+.98)ADC(+57.02)PR,RPR
1666,LLEDN(+.98)R(+14.02),LLEDN
1720,R(+14.02)DATENPR,RDATENPR


## Combine 331

In [8]:
cat TW_331_T5_trypsin_noenz_combine_PTMopt_DN_mod_peptides.txt TW_331_T5_undig_noenz_combine_PTMopt_DB_FDR1_mod_peptides.txt > all_331.csv

In [9]:
# read in the combined datafile as a dataframe

all_331 = pd.read_csv("all_331.csv", header = None)

all_331.columns = ['Peptide']

print('Total peptides:', len(all_331))

all_331.head()

Total peptides: 2326


Unnamed: 0,Peptide
0,KFDEETK
1,FGGN(+.98)VLEVNK
2,DPN(+.98)LPLK(+42.01)H
3,VYLHPFHL
4,DDLPVVK


In [10]:
# take all lines if they contain deamindated asparagines and make new df

keep= ["N\(\+.98"]

N_deam_331 = all_331[all_331.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

N_deam_331['stripped_peptide'] = N_deam_331['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(N_deam_331))

# keep only stripped peptide column
ndeam_331_sp = N_deam_331[["stripped_peptide"]].dropna()

# write to txt file

ndeam_331_sp.to_csv('331-T5-combined-n-deam-stripped-peptides.txt', header=False, index=False)

N_deam_331.head()

Number of deamidated asparagine peptides: 390


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  N_deam_331['stripped_peptide'] = N_deam_331['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
1,FGGN(+.98)VLEVNK,FGGNVLEVNK
2,DPN(+.98)LPLK(+42.01)H,DPNH
5,TEFN(+.98)VLLK,TEFNVLLK
6,EGN(+.98)VLEVR,EGNVLEVR
8,N(+.98)LERLE,NLERLE


In [11]:
# take all lines if they contain lysine acetylations and make new df

keep= ["K\(\+42.01"]

K_acet_331 = all_331[all_331.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

K_acet_331['stripped_peptide'] = K_acet_331['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of lysine acetylation peptides:', len(K_acet_331))

# keep only stripped peptide column
kacet_331_sp = K_acet_331[["stripped_peptide"]].dropna()

# write to txt file

kacet_331_sp.to_csv('331-T5-combined-k-acet-stripped-peptides.txt', header=False, index=False)

K_acet_331.head()

Number of lysine acetylation peptides: 122


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  K_acet_331['stripped_peptide'] = K_acet_331['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
2,DPN(+.98)LPLK(+42.01)H,DPNH
7,DPDLPLK(+42.01)H,DPDLPLKH
16,VGK(+42.01)LLLPK,VGKLLLPK
22,K(+42.01)YDPDLPLLGH,KYDPDLPLLGH
26,K(+42.01)LLLPK,KLLLPK


In [12]:
# take all lines if they contain methylated arginines and make new df

keep= ["R\(\+14.02"]

R_meth_331 = all_331[all_331.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

R_meth_331['stripped_peptide'] = R_meth_331['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(R_meth_331))

# keep only stripped peptide column
rmeth_331_sp = R_meth_331[["stripped_peptide"]].dropna()

# write to txt file

rmeth_331_sp.to_csv('331-T5-combined-r-meth-stripped-peptides.txt', header=False, index=False)

R_meth_331.head()

Number of deamidated asparagine peptides: 84


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  R_meth_331['stripped_peptide'] = R_meth_331['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
71,TN(+.98)ATR(+14.02)TT,TNTT
170,LVMR(+14.02)DNL,LVMRDNL
204,DSER(+14.02)MC(+57.02)PDDK,DSERPDDK
224,TR(+14.02)LHPPVP,TRLHPPVP
225,VGDFR(+14.02)P(+15.99)DL,VGDFRDL
