## Evaluating the PTMs across peptides from different cellular compartments

### Beginning with:

    Exported peptides lists (.csvs) that contain the AAs with modifications. Want to combine peptides from the following:
    
     - from trypsin and no-digest searches
     - from DB and DN searches
     
    From 8 samples (4 timepoints and 2 treatments, trypsin- and naturallyd-digested
    
     - 322: Day 0 trypsin digested
     - 323: Day 2 trypsin digested
     - 324: Day 5 trypsin digested
     - 325: Day 12 trypsin digested
     - 329: Day 0 undigested
     - 330: Day 2 undigested
     - 331: Day 5 undigested
     - 332: Day 12 undigested
    
### Want:

    Text files with all the stripped (no mod) peptides for the following modifications:
        
        - Lysine acetylation
        - Asparagine demidation
        - Arginine methylation
        

In [1]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

## Combine 323

In [2]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/processed/PTM-cellular-compartment/to-combine/T2/

/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PTM-cellular-compartment/to-combine/T2


In [3]:
cat TW_323_T2_trypsin_noenz_combine_PTMopt_DB_FDR1_mod_peptides.txt TW_323_T2_trypsin_combine_PTMopt_DB_FDR1_mod_peptides.txt TW_323_T2_trypsin_combine_PTMopt_DN_mod_peptides.txt TW_323_T2_trypsin_noenz_combine_PTMopt_DN_mod_peptides.txt > all_323.csv

In [4]:
# read in the combined datafile as a dataframe

all_323 = pd.read_csv("all_323.csv", header = None)

all_323.columns = ['Peptide']

print('Total peptides:', len(all_323))

all_323.head()

Total peptides: 11323


Unnamed: 0,Peptide
0,LPQVEGTGGDVQPSQDLVR
1,AAIGPGIGQGNAAGQAVEGIAR
2,VIGQNEAVDAVSNAIR
3,AIDLIDEAASSIR
4,GPAPLPLALAHLD


In [5]:
# take all lines if they contain deamindated asparagines and make new df

keep= ["N\(\+.98"]

N_deam_323 = all_323[all_323.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

N_deam_323['stripped_peptide'] = N_deam_323['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(N_deam_323))

# keep only stripped peptide column
ndeam_323_sp = N_deam_323[["stripped_peptide"]].dropna()

# write to txt file

ndeam_323_sp.to_csv('323-T2-combined-n-deam-stripped-peptides.txt', header=False, index=False)

N_deam_323.head()

Number of deamidated asparagine peptides: 1440


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  N_deam_323['stripped_peptide'] = N_deam_323['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
37,DTNN(+.98)GNVWAPLLK,DTNNGNVWAPLLK
64,FDN(+.98)TTTVVELAK,FDNTTTVVELAK
139,SNGDGVIDIN(+.98)DK,SNGDGVIDINDK
235,N(+.98)NPVLIGEPGVGK,NNPVLIGEPGVGK
341,TGEWVYLN(+.98)EFGQR,TGEWVYLNEFGQR


In [6]:
# take all lines if they contain lysine acetylations and make new df

keep= ["K\(\+42.01"]

K_acet_323 = all_323[all_323.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

K_acet_323['stripped_peptide'] = K_acet_323['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of lysine acetylation peptides:', len(K_acet_323))

# keep only stripped peptide column
kacet_323_sp = K_acet_323[["stripped_peptide"]].dropna()

# write to txt file

kacet_323_sp.to_csv('323-T2-combined-k-acet-stripped-peptides.txt', header=False, index=False)

K_acet_323.head()

Number of lysine acetylation peptides: 453


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  K_acet_323['stripped_peptide'] = K_acet_323['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
2203,TTEEK(+42.01)R,TTEEKR
2226,K(+42.01)KNDEELNK,KKNDEELNK
2336,K(+42.01)N(+.98)ELSEEDR,KELSEEDR
2337,LLNEK(+42.01)R,LLNEKR
2345,K(+42.01)VDEEETK,KVDEEETK


In [7]:
# take all lines if they contain methylated arginines and make new df

keep= ["R\(\+14.02"]

R_meth_323 = all_323[all_323.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

R_meth_323['stripped_peptide'] = R_meth_323['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(R_meth_323))

# keep only stripped peptide column
rmeth_323_sp = R_meth_323[["stripped_peptide"]].dropna()

# write to txt file

rmeth_323_sp.to_csv('323-T2-combined-r-meth-stripped-peptides.txt', header=False, index=False)

R_meth_323.head()

Number of deamidated asparagine peptides: 558


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  R_meth_323['stripped_peptide'] = R_meth_323['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
1983,MAFSHHSR(+14.02),MAFSHHSR
2303,WEELFR(+14.02),WEELFR
2405,TGVEAWGDR(+14.02),TGVEAWGDR
2433,R(+14.02)C(+57.02)ASSTGR,RASSTGR
2509,R(+14.02)LDN(+.98)ADC(+57.02)PR,RPR


## Combine 330

In [99]:
cat TW_330_T2_trypsin_noenz_combine_PTMopt_DN_mod_peptides.txt TW_330_T2_undig_noenz_combine_PTMopt_DB_FDR1_mod_peptides.txt > all_330.csv

In [100]:
# read in the combined datafile as a dataframe

all_330 = pd.read_csv("all_330.csv", header = None)

all_330.columns = ['Peptide']

print('Total peptides:', len(all_330))

all_330.head()

Total peptides: 2118


Unnamed: 0,Peptide
0,HLDVDDSGK
1,DKFDEETK
2,DPN(+.98)LPLK(+42.01)H
3,P(+15.99)KEKFE
4,DHGEVVVK


In [101]:
# take all lines if they contain deamindated asparagines and make new df

keep= ["N\(\+.98"]

N_deam_330 = all_330[all_330.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

N_deam_330['stripped_peptide'] = N_deam_330['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(N_deam_330))

# keep only stripped peptide column
ndeam_330_sp = N_deam_330[["stripped_peptide"]].dropna()

# write to txt file

ndeam_330_sp.to_csv('330-T2-combined-n-deam-stripped-peptides.txt', header=False, index=False)

N_deam_330.head()

Number of deamidated asparagine peptides: 364


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  N_deam_329['stripped_peptide'] = N_deam_329['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
2,DPN(+.98)LPLK(+42.01)H,DPNH
7,SPN(+.98)N(+.98)SLK,SPNSLK
8,YN(+.98)PDLPLLGH,YNPDLPLLGH
11,DN(+.98)ADQERF,DNADQERF
12,DPN(+.98)LPLVAH,DPNLPLVAH


In [86]:
# take all lines if they contain lysine acetylations and make new df

keep= ["K\(\+42.01"]

K_acet_330 = all_330[all_330.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

K_acet_330['stripped_peptide'] = K_acet_330['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of lysine acetylation peptides:', len(K_acet_330))

# keep only stripped peptide column
kacet_330_sp = K_acet_330[["stripped_peptide"]].dropna()

# write to txt file

kacet_330_sp.to_csv('330-T2-combined-k-acet-stripped-peptides.txt', header=False, index=False)

K_acet_330.head()

Number of lysine acetylation peptides: 90


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  K_acet_329['stripped_peptide'] = K_acet_329['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
2,DPN(+.98)LPLK(+42.01)H,DPNH
17,WLVK(+42.01)LP,WLVKLP
81,TN(+.98)QQLSK(+42.01),TN
146,K(+42.01)PLFDLKDR(+15.99)P,KP
154,K(+42.01)YDPDLPLLGH,KYDPDLPLLGH


In [87]:
# take all lines if they contain methylated arginines and make new df

keep= ["R\(\+14.02"]

R_meth_330 = all_330[all_330.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

R_meth_330['stripped_peptide'] = R_meth_330['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(R_meth_330))

# keep only stripped peptide column
rmeth_330_sp = R_meth_330[["stripped_peptide"]].dropna()

# write to txt file

rmeth_330_sp.to_csv('330-T2-combined-r-meth-stripped-peptides.txt', header=False, index=False)

R_meth_330.head()

Number of deamidated asparagine peptides: 75


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  R_meth_329['stripped_peptide'] = R_meth_329['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
36,LGELVR(+14.02)P,LGELVRP
83,LLVVC(+57.02)R(+14.02)TP(+15.99),LLVVC
102,R(+14.02)PWTQT,RPWTQT
104,R(+14.02)KPSDPEE,RKPSDPEE
145,VLGADDR(+14.02)N,VLGADDRN
