### Manipulation of PEAKS de novo results of Trocas 7 (high water, April 2019) incubation samples LC-MS/MS data using python.

Starting with:

    PEAKS de novo results (.csv) of PTM-optimized sequencing >50% ALC

Goal:

    Files with stripped (no PTMs) peptide lists and
    Columns with #'s of each modification in every sequence
    Column with stripped peptide lengths (# amino acids)
    
### To use:

#### 1. Change the input file name in *IN 4*
#### 2. Use 'find + replace' (Esc + F) to replace the TROCAS # (e.g., 101) for another
#### 2. Change output file name in *IN 6*, *IN 7*, *IN 8*

In [11]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [12]:
cd /home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/T7-INC-UWPR_Apr2021_Fus_T00T24/DN50_10lgP15/106_TROCAS7_Fusion_Apr2021_DENOVO_140/

/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/T7-INC-UWPR_Apr2021_Fus_T00T24/DN50_10lgP15/106_TROCAS7_Fusion_Apr2021_DENOVO_140


In [13]:
# read the CSVs into a dataframe using the read_csv function and call 'peaks'

peaks106 = pd.read_csv("/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/T7-INC-UWPR_Apr2021_Fus_T00T24/DN50_10lgP15/106_TROCAS7_Fusion_Apr2021_DENOVO_140/106_TROCAS7_Fusion_Apr2021_DENOVO_140_DN50.csv")

print("# redundant Peaks peptides >50% ALC in combined dataframe:", len(peaks106))

print(peaks106.columns)

# These columns mess things up- get rid of them

del peaks106['Fraction']
del peaks106['Scan']
del peaks106['Source File']
del peaks106['Tag Length']
del peaks106['PTM']
del peaks106['tag (>=0%)']
del peaks106['mode']
del peaks106['local confidence (%)']


columns = ['Peptide', 'ALC', 'length', 'm/z', 'z', 'RT', 'Area',
       'Mass', 'ppm']

peaks106.columns = columns

mean_len = peaks106['length'].mean()
print(mean_len)

# look at the dataframe
peaks106.head()

# redundant Peaks peptides >50% ALC in combined dataframe: 177
Index(['Fraction', 'Scan', 'Source File', 'Peptide', 'Tag Length', 'ALC (%)',
       'length', 'm/z', 'z', 'RT', 'Area', 'Mass', 'ppm', 'PTM',
       'local confidence (%)', 'tag (>=0%)', 'mode'],
      dtype='object')
11.045197740112995


Unnamed: 0,Peptide,ALC,length,m/z,z,RT,Area,Mass,ppm
0,LSSPATLNSR,97,10,523.2864,2,46.95,78200000.0,1044.5564,1.8
1,LSSPATLNSR,96,10,523.2872,2,47.46,78200000.0,1044.5564,3.2
2,LSSPATLNSR,96,10,523.2858,2,47.96,78200000.0,1044.5564,0.7
3,LSSPATLDSR,95,10,523.7776,2,50.85,532000.0,1045.5403,0.4
4,LATVLSPR,95,8,428.767,2,61.79,40100000.0,855.5178,2.0


The peptide column has the masses of modifications (e.g., 57.02 Da for carbamidomethylation of cysteine). We want to make new columns with all that information and make a column with only the 'stripped' peptide sequence that's just amino acids - this we can then align against other sequences, for example.

Modified residues were allowed for:

    fixed carbamodimethylation of cysteine 57.021464 C
    varialbe oxidation of methionine: 15.9949 M
    variable deamidation of asparagine, glumatine: 0.984016 NQ

We'll then write this manipulated dataframe to a new file.

In [14]:
# use a count function to enumerate the # of A's (alanines) in each peptide
peaks106['A'] = peaks106['Peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
peaks106['C'] = peaks106['Peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
peaks106['D'] = peaks106['Peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
peaks106['E'] = peaks106['Peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
peaks106['F'] = peaks106['Peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
peaks106['G'] = peaks106['Peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
peaks106['H'] = peaks106['Peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in peaks106 output, there will be no isoleucines (they're lumped in with leucines)
peaks106['I'] = peaks106['Peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
peaks106['K'] = peaks106['Peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
peaks106['L'] = peaks106['Peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
peaks106['M'] = peaks106['Peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
peaks106['N'] = peaks106['Peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
peaks106['P'] = peaks106['Peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
peaks106['Q'] = peaks106['Peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
peaks106['R'] = peaks106['Peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
peaks106['S'] = peaks106['Peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
peaks106['T'] = peaks106['Peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
peaks106['V'] = peaks106['Peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
peaks106['W'] = peaks106['Peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
peaks106['Y'] = peaks106['Peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
peaks106['c-carb'] = peaks106['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks106['m-oxid'] = peaks106['Peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
peaks106['n-deam'] = peaks106['Peptide'].apply(lambda x: x.count('N(+.98)'))

# use a count function to enumerate the # of deamidated Q's in each peptide
peaks106['q-deam'] = peaks106['Peptide'].apply(lambda x: x.count('Q(+.98)'))

# create a column with 'stripped' peptide sequences using strip
peaks106['stripped_peptide'] = peaks106['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks106['stripped_length'] = peaks106['stripped_peptide'].apply(len)

# total the number of modifications in sequence
peaks106['ptm-total'] = peaks106['c-carb'] + peaks106['m-oxid'] + peaks106['n-deam'] + peaks106['q-deam']

# calculate NAAF numerator for each peptide k
peaks106['NAAF_num.'] = peaks106['Area'] / peaks106['stripped_length']

# write modified dataframe to new txt file, same name + 'stripped'
peaks106.to_csv("/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/processed/PeaksDN/106_CV_T00_GF_DN/106A_CV_T00_GF_DN50.csv")

# check out the results
peaks106.head()

Unnamed: 0,Peptide,ALC,length,m/z,z,RT,Area,Mass,ppm,A,...,W,Y,c-carb,m-oxid,n-deam,q-deam,stripped_peptide,stripped_length,ptm-total,NAAF_num.
0,LSSPATLNSR,97,10,523.2864,2,46.95,78200000.0,1044.5564,1.8,1,...,0,0,0,0,0,0,LSSPATLNSR,10,0,7820000.0
1,LSSPATLNSR,96,10,523.2872,2,47.46,78200000.0,1044.5564,3.2,1,...,0,0,0,0,0,0,LSSPATLNSR,10,0,7820000.0
2,LSSPATLNSR,96,10,523.2858,2,47.96,78200000.0,1044.5564,0.7,1,...,0,0,0,0,0,0,LSSPATLNSR,10,0,7820000.0
3,LSSPATLDSR,95,10,523.7776,2,50.85,532000.0,1045.5403,0.4,1,...,0,0,0,0,0,0,LSSPATLDSR,10,0,53200.0
4,LATVLSPR,95,8,428.767,2,61.79,40100000.0,855.5178,2.0,1,...,0,0,0,0,0,0,LATVLSPR,8,0,5012500.0


In [15]:
# keep only stripped peptide I/L and NAAF
dn_106 = peaks106[['stripped_peptide', 'Area', 'NAAF_num.']]

dn_106.set_index('stripped_peptide')

# write modified dataframe to new txt file
dn_106.to_csv("/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/processed/I-L_NAAFs/106A_CV_T00_GF_DN50_ILnaafs.csv")

dn_106.head()

Unnamed: 0,stripped_peptide,Area,NAAF_num.
0,LSSPATLNSR,78200000.0,7820000.0
1,LSSPATLNSR,78200000.0,7820000.0
2,LSSPATLNSR,78200000.0,7820000.0
3,LSSPATLDSR,532000.0,53200.0
4,LATVLSPR,40100000.0,5012500.0


In [16]:
# made a new dataframe that contains the suMN of certain columns 
# in the stripped peptide dataframe above (for >50% ALC)

index = ['sample total']

data = {'A': peaks106['A'].sum(),
        'C': peaks106['C'].sum(),
        'D': peaks106['D'].sum(),
        'E': peaks106['E'].sum(),
        'F': peaks106['F'].sum(),
        'G': peaks106['G'].sum(),
        'H': peaks106['H'].sum(),
        'I': peaks106['I'].sum(),
        'K': peaks106['K'].sum(),
        'L': peaks106['L'].sum(),
        'M': peaks106['M'].sum(),
        'N': peaks106['N'].sum(),
        'P': peaks106['P'].sum(),
        'Q': peaks106['Q'].sum(),
        'R': peaks106['R'].sum(),
        'S': peaks106['S'].sum(),
        'T': peaks106['T'].sum(),
        'V': peaks106['V'].sum(),
        'W': peaks106['W'].sum(),
        'Y': peaks106['Y'].sum(),
        'c-carb': peaks106['c-carb'].sum(),
        'm-oxid': peaks106['m-oxid'].sum(),
        'n-deam': peaks106['n-deam'].sum(),
        'q-deam': peaks106['q-deam'].sum(),
        'Total area': peaks106['Area'].sum(),
        'Total length': peaks106['stripped_length'].sum()
       }

totalpeaks106 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', \
                                            'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', \
                                            'c-carb', 'm-oxid', 'n-deam', 'q-deam', \
                                            'Total area', 'Total length'], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalpeaks106['% C w/ carb'] = totalpeaks106['c-carb'] / totalpeaks106['C'] 

# calculate percentage of M's that are oxidized
totalpeaks106['% M w/ oxid'] = totalpeaks106['m-oxid'] / totalpeaks106['M'] 

# calculate percentage of N's that are deamidated
totalpeaks106['% N w/ deam'] = totalpeaks106['n-deam'] / totalpeaks106['N'] 

# calculate percentage of N's that are deamidated
totalpeaks106['% Q w/ deam'] = totalpeaks106['q-deam'] / totalpeaks106['Q'] 

# calculate NAAF denominator for all peptides in dataset i
totalpeaks106['NAAF denom.'] = totalpeaks106['Total area'] / totalpeaks106['Total length']

# write modified dataframe to new txt file
totalpeaks106.to_csv("/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/processed/PeaksDN/106_CV_T00_GF_DN/106A_CV_T00_GF_DN50_totals.csv")

totalpeaks106.head()

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,m-oxid,n-deam,q-deam,Total area,Total length,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,NAAF denom.
sample total,149,57,34,51,50,117,103,0,136,229,...,16,5,1,1928283000.0,1830,1.0,0.172043,0.064935,0.033333,1053707.0


In [17]:
# use the calculated NAAF factor (in totalpeaks dataframe, above) to caluclate the NAAF 
# NAAF: normalized normalized area abundance factor

# don't have to worry here about DECOY hits messing with Area totals
# but we would with Comet results

NAAF50 = 1.053707e+06

# use NAAF >50% ALC to get NAAF factor
peaks106['NAAF factor'] = (peaks106['NAAF_num.'])/NAAF50

# make a dataframe that contains only what we need: sequences, AAs, PTMN
peaksAAPTM_106 = peaks106[['stripped_peptide', 'NAAF factor', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'K', 'I', 'L', \
                                'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', \
                                'n-deam', 'q-deam']].copy()

# multiply the NAAF50 factor by the AA total to normalize its abundance by peak area and peptide length

peaksAAPTM_106['A-NAAF50'] = peaksAAPTM_106['A'] * peaks106['NAAF factor']
peaksAAPTM_106['C-NAAF50'] = peaksAAPTM_106['C'] * peaks106['NAAF factor']
peaksAAPTM_106['D-NAAF50'] = peaksAAPTM_106['D'] * peaks106['NAAF factor']
peaksAAPTM_106['E-NAAF50'] = peaksAAPTM_106['E'] * peaks106['NAAF factor']
peaksAAPTM_106['F-NAAF50'] = peaksAAPTM_106['F'] * peaks106['NAAF factor']
peaksAAPTM_106['G-NAAF50'] = peaksAAPTM_106['G'] * peaks106['NAAF factor']
peaksAAPTM_106['H-NAAF50'] = peaksAAPTM_106['H'] * peaks106['NAAF factor']
peaksAAPTM_106['I-NAAF50'] = peaksAAPTM_106['I'] * peaks106['NAAF factor']
peaksAAPTM_106['K-NAAF50'] = peaksAAPTM_106['K'] * peaks106['NAAF factor']
peaksAAPTM_106['L-NAAF50'] = peaksAAPTM_106['L'] * peaks106['NAAF factor']
peaksAAPTM_106['M-NAAF50'] = peaksAAPTM_106['M'] * peaks106['NAAF factor']
peaksAAPTM_106['N-NAAF50'] = peaksAAPTM_106['N'] * peaks106['NAAF factor']
peaksAAPTM_106['P-NAAF50'] = peaksAAPTM_106['P'] * peaks106['NAAF factor']
peaksAAPTM_106['Q-NAAF50'] = peaksAAPTM_106['Q'] * peaks106['NAAF factor']
peaksAAPTM_106['R-NAAF50'] = peaksAAPTM_106['R'] * peaks106['NAAF factor']
peaksAAPTM_106['S-NAAF50'] = peaksAAPTM_106['S'] * peaks106['NAAF factor']
peaksAAPTM_106['T-NAAF50'] = peaksAAPTM_106['T'] * peaks106['NAAF factor']
peaksAAPTM_106['V-NAAF50'] = peaksAAPTM_106['V'] * peaks106['NAAF factor']
peaksAAPTM_106['W-NAAF50'] = peaksAAPTM_106['W'] * peaks106['NAAF factor']
peaksAAPTM_106['Y-NAAF50'] = peaksAAPTM_106['Y'] * peaks106['NAAF factor']

# multiply the NAAF50 factor by the PTMN normalize its abundance by peak area and peptide length

peaksAAPTM_106['ccarb-NAAF50'] = peaksAAPTM_106['c-carb'] * peaksAAPTM_106['NAAF factor']
peaksAAPTM_106['moxid-NAAF50'] = peaksAAPTM_106['m-oxid'] * peaksAAPTM_106['NAAF factor']
peaksAAPTM_106['ndeam-NAAF50'] = peaksAAPTM_106['n-deam'] * peaksAAPTM_106['NAAF factor']
peaksAAPTM_106['qdeam-NAAF50'] = peaksAAPTM_106['q-deam'] * peaksAAPTM_106['NAAF factor']


# write the dataframe to a new csv
peaksAAPTM_106.to_csv("/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/processed/PeaksDN/106_CV_T00_GF_DN/106A_CV_T00_GF_DN50_naaf.csv")

peaksAAPTM_106.head()

Unnamed: 0,stripped_peptide,NAAF factor,A,C,D,E,F,G,H,K,...,R-NAAF50,S-NAAF50,T-NAAF50,V-NAAF50,W-NAAF50,Y-NAAF50,ccarb-NAAF50,moxid-NAAF50,ndeam-NAAF50,qdeam-NAAF50
0,LSSPATLNSR,7.421418,1,0,0,0,0,0,0,0,...,7.421418,22.264254,7.421418,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,LSSPATLNSR,7.421418,1,0,0,0,0,0,0,0,...,7.421418,22.264254,7.421418,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,LSSPATLNSR,7.421418,1,0,0,0,0,0,0,0,...,7.421418,22.264254,7.421418,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,LSSPATLDSR,0.050488,1,0,1,0,0,0,0,0,...,0.050488,0.151465,0.050488,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,LATVLSPR,4.757015,1,0,0,0,0,0,0,0,...,4.757015,4.757015,4.757015,4.757015,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# made a dataframe that's the sum of NAAF corrected AAs and PTMN

index = ['sample total']

data = {'NAAF': peaksAAPTM_106['NAAF factor'].sum(),
        'A': peaksAAPTM_106['A-NAAF50'].sum(),
        'C': peaksAAPTM_106['C-NAAF50'].sum(),
        'D': peaksAAPTM_106['D-NAAF50'].sum(),
        'E': peaksAAPTM_106['E-NAAF50'].sum(),
        'F': peaksAAPTM_106['F-NAAF50'].sum(),
        'G': peaksAAPTM_106['G-NAAF50'].sum(),
        'H': peaksAAPTM_106['H-NAAF50'].sum(),
        'I': peaksAAPTM_106['I-NAAF50'].sum(),
        'K': peaksAAPTM_106['K-NAAF50'].sum(),
        'L': peaksAAPTM_106['L-NAAF50'].sum(),
        'M': peaksAAPTM_106['M-NAAF50'].sum(),
        'N': peaksAAPTM_106['N-NAAF50'].sum(),
        'P': peaksAAPTM_106['P-NAAF50'].sum(),
        'Q': peaksAAPTM_106['Q-NAAF50'].sum(),
        'R': peaksAAPTM_106['R-NAAF50'].sum(),
        'S': peaksAAPTM_106['S-NAAF50'].sum(),
        'T': peaksAAPTM_106['T-NAAF50'].sum(),
        'V': peaksAAPTM_106['V-NAAF50'].sum(),
        'W': peaksAAPTM_106['W-NAAF50'].sum(),
        'Y': peaksAAPTM_106['Y-NAAF50'].sum(),
        'c-carb': peaksAAPTM_106['ccarb-NAAF50'].sum(),
        'm-oxid': peaksAAPTM_106['moxid-NAAF50'].sum(),
        'n-deam': peaksAAPTM_106['ndeam-NAAF50'].sum(),
        'q-deam': peaksAAPTM_106['qdeam-NAAF50'].sum(),
       }

totalpeaks50_NAAF = pd.DataFrame(data, columns=['NAAF', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', \
                                           'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', \
                                           'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', 'q-deam'], index=index)

# calculate NAAF-corrected percentage of C's with carb (should be 1.0)
totalpeaks50_NAAF['% C w/ carb.'] = totalpeaks50_NAAF['c-carb'] / totalpeaks50_NAAF['C'] 

# calculate NAAF-corrected percentage of M's that are oxidized
totalpeaks50_NAAF['% M w/ oxid'] = totalpeaks50_NAAF['m-oxid'] / totalpeaks50_NAAF['M'] 

# calculate NAAF-corrected percentage of N's that are deamidated
totalpeaks50_NAAF['% N w/ deam'] = totalpeaks50_NAAF['n-deam'] / totalpeaks50_NAAF['N'] 

# calculate NAAF-corrected percentage of N's that are deamidated
totalpeaks50_NAAF['% Q w/ deam'] = totalpeaks50_NAAF['q-deam'] / totalpeaks50_NAAF['Q'] 

# calculate NAAF summed numerator over denominator (in above cell) for all peptides in dataset i: a check
totalpeaks50_NAAF['NAAF check'] = totalpeaks50_NAAF['NAAF'] / 1.053707e+06

# write modified dataframe to new txt file, same name + totals
totalpeaks50_NAAF.to_csv("/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/processed/PeaksDN/106_CV_T00_GF_DN/106A_CV_T00_GF_DN50_naaf_totals.csv")

totalpeaks50_NAAF.head()

Unnamed: 0,NAAF,A,C,D,E,F,G,H,I,K,...,Y,c-carb,m-oxid,n-deam,q-deam,% C w/ carb.,% M w/ oxid,% N w/ deam,% Q w/ deam,NAAF check
sample total,214.592065,154.452393,7.725057,3.155193,5.926626,5.714729,14.021258,9.237067,0.0,10.07927,...,4.752753,7.725057,0.637027,0.326376,0.080668,1.0,0.104168,0.007591,0.024358,0.000204


## Export stripped peptides >50% ALC

In [19]:
##### keep only stripped peptide column 
pep50 = peaks106[["stripped_peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep50.to_csv("/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/processed/PeaksDN/106_CV_T00_GF_DN/106A_CV_T00_GF_DN50_stripped_peptides.txt", header=False, index=False)

# made the text file into a FASTA 
!awk '{print ">"NR"\n"$0}' /home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/processed/PeaksDN/106_CV_T00_GF_DN/106A_CV_T00_GF_DN50_stripped_peptides.txt > \
/home/millieginty/Documents/git-repos/amazon/data/Trocas7-incubations/processed/PeaksDN/106_CV_T00_GF_DN/106A_CV_T00_GF_DN50_stripped_peptides.fas

# removing redundancy
peaks50dedup = pd.DataFrame.drop_duplicates(pep50)

print("# redundant stripped Peaks peptides >50% ALC", len(pep50))
print("# nonredundant stripped Peaks peptides >50% ALC", len(peaks50dedup))
print("average peptide length Peaks peptides >50% ALC", peaks106['stripped_length'].mean())

# count all unique peptide (modified peptides included)
# keep only peptide column >50% ALC
pep50m = peaks106[["Peptide"]]

# deduplicate
pep50mdedup = pd.DataFrame.drop_duplicates(pep50m)

print("# redundant Peaks peptides >50% ALC", len(pep50m))
print("# nonredundant Peaks peptides", len(pep50mdedup))

# check
pep50.head()

# redundant stripped Peaks peptides >50% ALC 177
# nonredundant stripped Peaks peptides >50% ALC 126
average peptide length Peaks peptides >50% ALC 10.338983050847459
# redundant Peaks peptides >50% ALC 177
# nonredundant Peaks peptides 126


Unnamed: 0,stripped_peptide
0,LSSPATLNSR
1,LSSPATLNSR
2,LSSPATLNSR
3,LSSPATLDSR
4,LATVLSPR
