## Evaluating the PTMs across peptides from different cellular compartments

### Beginning with:

    Exported peptides lists (.csvs) that contain the AAs with modifications. Want to combine peptides from the following:
    
     - from trypsin and no-digest searches
     - from DB and DN searches
     
    From 8 samples (4 timepoints and 2 treatments, trypsin- and naturallyd-digested
    
     - 325: Day 0 trypsin digested
     - 323: Day 2 trypsin digested
     - 324: Day 5 trypsin digested
     - 325: Day 12 trypsin digested
     - 332: Day 0 undigested
     - 330: Day 2 undigested
     - 331: Day 5 undigested
     - 332: Day 12 undigested
    
### Want:

    Text files with all the stripped (no mod) peptides for the following modifications:
        
        - Lysine acetylation
        - Asparagine demidation
        - Arginine methylation
        

In [1]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

## Combine 325

In [2]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/processed/PTM-cellular-compartment/to-combine/T12/

/home/millieginty/Documents/git-repos/rot-mayer/data/processed/PTM-cellular-compartment/to-combine/T12


In [3]:
cat TW_325_T12_trypsin_noenz_combine_PTMopt_DB_FDR1_mod_peptides.txt TW_325_T12_trypsin_combine_PTMopt_DB_FDR1_mod_peptides.txt TW_325_T12_trypsin_combine_PTMopt_DN_mod_peptides.txt TW_325_T12_trypsin_noenz_combine_PTMopt_DN_mod_peptides.txt > all_325.csv

In [4]:
# read in the combined datafile as a dataframe

all_325 = pd.read_csv("all_325.csv", header = None)

all_325.columns = ['Peptide']

print('Total peptides:', len(all_325))

all_325.head()

Total peptides: 10571


Unnamed: 0,Peptide
0,LPQVEGTGGDVQPSQDLVR
1,STEFDNILIVGPIAGK
2,LPQVEGTGGDVQPSQ(+.98)DLVR
3,VIGQNEAVDAVSNAIR
4,AIGPGIGQGNAAGQAVEGIAR


In [5]:
# take all lines if they contain deamindated asparagines and make new df

keep= ["N\(\+.98"]

N_deam_325 = all_325[all_325.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

N_deam_325['stripped_peptide'] = N_deam_325['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(N_deam_325))

# keep only stripped peptide column
ndeam_325_sp = N_deam_325[["stripped_peptide"]].dropna()

# write to txt file

ndeam_325_sp.to_csv('325-T12-combined-n-deam-stripped-peptides.txt', header=False, index=False)

N_deam_325.head()

Number of deamidated asparagine peptides: 1490


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  N_deam_325['stripped_peptide'] = N_deam_325['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
25,VIGQN(+.98)EAVDAVSNAIR,VIGQNEAVDAVSNAIR
97,N(+.98)NPVLIGEPGVGK,NNPVLIGEPGVGK
104,LGEHNIDVLEGN(+.98),LGEHNIDVLEGN
177,NN(+.98)PVLIGEPGVGK,
228,LGEHN(+.98)IDVLEG,LGEHNIDVLEG


In [6]:
# take all lines if they contain lysine acetylations and make new df

keep= ["K\(\+42.01"]

K_acet_325 = all_325[all_325.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

K_acet_325['stripped_peptide'] = K_acet_325['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of lysine acetylation peptides:', len(K_acet_325))

# keep only stripped peptide column
kacet_325_sp = K_acet_325[["stripped_peptide"]].dropna()

# write to txt file

kacet_325_sp.to_csv('325-T12-combined-k-acet-stripped-peptides.txt', header=False, index=False)

K_acet_325.head()

Number of lysine acetylation peptides: 560


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  K_acet_325['stripped_peptide'] = K_acet_325['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
739,NGALDFGWDK(+15.99)FDAETK(+42.01),NGALDFGWDK
923,TTEEK(+42.01)R,TTEEKR
948,K(+42.01)LLLPK,KLLLPK
962,K(+42.01)LFGTLTK,KLFGTLTK
975,FEEAK(+42.01)R,FEEAKR


In [7]:
# take all lines if they contain methylated arginines and make new df

keep= ["R\(\+14.02"]

R_meth_325 = all_325[all_325.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

R_meth_325['stripped_peptide'] = R_meth_325['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(R_meth_325))

# keep only stripped peptide column
rmeth_325_sp = R_meth_325[["stripped_peptide"]].dropna()

# write to txt file

rmeth_325_sp.to_csv('325-T12-combined-r-meth-stripped-peptides.txt', header=False, index=False)

R_meth_325.head()

Number of deamidated asparagine peptides: 722


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  R_meth_325['stripped_peptide'] = R_meth_325['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
1012,TAGTGDTVNR(+14.02),TAGTGDTVNR
1015,C(+57.02)VVDP(+15.99)R(+14.02)K,CK
1272,R(+14.02)LDN(+.98)AN(+.98)C(+57.02)PR,RPR
1298,R(+14.02)WLEVK,RWLEVK
1304,R(+14.02)TTP(+15.99)VLR,RVLR


## Combine 332

In [8]:
cat TW_332_T12_trypsin_noenz_combine_PTMopt_DN_mod_peptides.txt TW_332_T12_undig_noenz_combine_PTMopt_DB_FDR1_mod_peptides.txt > all_332.csv

In [9]:
# read in the combined datafile as a dataframe

all_332 = pd.read_csv("all_332.csv", header = None)

all_332.columns = ['Peptide']

print('Total peptides:', len(all_332))

all_332.head()

Total peptides: 2687


Unnamed: 0,Peptide
0,HEDWQLK
1,LQNLHLL
2,KR(+15.99)LVVEN
3,DPN(+.98)LPLK(+42.01)H
4,KDETTLVDGH


In [10]:
# take all lines if they contain deamindated asparagines and make new df

keep= ["N\(\+.98"]

N_deam_332 = all_332[all_332.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

N_deam_332['stripped_peptide'] = N_deam_332['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(N_deam_332))

# keep only stripped peptide column
ndeam_332_sp = N_deam_332[["stripped_peptide"]].dropna()

# write to txt file

ndeam_332_sp.to_csv('332-T12-combined-n-deam-stripped-peptides.txt', header=False, index=False)

N_deam_332.head()

Number of deamidated asparagine peptides: 451


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  N_deam_332['stripped_peptide'] = N_deam_332['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
3,DPN(+.98)LPLK(+42.01)H,DPNH
9,WN(+.98)FVEL,WNFVEL
12,VHN(+.98)RVSLK,VHNRVSLK
15,SGN(+.98)DLTRQ,SGNDLTRQ
17,NPVN(+.98)LVLDH,NPVNLVLDH


In [11]:
# take all lines if they contain lysine acetylations and make new df

keep= ["K\(\+42.01"]

K_acet_332 = all_332[all_332.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

K_acet_332['stripped_peptide'] = K_acet_332['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of lysine acetylation peptides:', len(K_acet_332))

# keep only stripped peptide column
kacet_332_sp = K_acet_332[["stripped_peptide"]].dropna()

# write to txt file

kacet_332_sp.to_csv('332-T12-combined-k-acet-stripped-peptides.txt', header=False, index=False)

K_acet_332.head()

Number of lysine acetylation peptides: 174


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  K_acet_332['stripped_peptide'] = K_acet_332['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
3,DPN(+.98)LPLK(+42.01)H,DPNH
23,K(+42.01)LLLPK,KLLLPK
26,NVGK(+42.01)LLLPK,NVGKLLLPK
37,VGK(+42.01)LLLPK,VGKLLLPK
43,TTM(+15.99)DEK(+42.01)TT,TTMTT


In [12]:
# take all lines if they contain methylated arginines and make new df

keep= ["R\(\+14.02"]

R_meth_332 = all_332[all_332.Peptide.str.contains('|'.join(keep))]

# now strip the special characters of the modification

R_meth_332['stripped_peptide'] = R_meth_332['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()

print('Number of deamidated asparagine peptides:', len(R_meth_332))

# keep only stripped peptide column
rmeth_332_sp = R_meth_332[["stripped_peptide"]].dropna()

# write to txt file

rmeth_332_sp.to_csv('332-T12-combined-r-meth-stripped-peptides.txt', header=False, index=False)

R_meth_332.head()

Number of deamidated asparagine peptides: 130


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  R_meth_332['stripped_peptide'] = R_meth_332['Peptide'].str.replace(r"\(.*\)","").drop_duplicates()


Unnamed: 0,Peptide,stripped_peptide
68,R(+14.02)PLVVLK,RPLVVLK
78,TTTLR(+14.02)DL,TTTLRDL
82,R(+14.02)KPMPPL,RKPMPPL
93,R(+14.02)DDVLAK,RDDVLAK
241,R(+14.02)TELGKL,RTELGKL
