# Practice 1 - calculate peptide mass from sequence.

When you have a protein sequence, such as “PEPTIDE” or “EDITPEP”, you can calculate the mass of the whole peptide by adding all the mass of individual amino acids together minus the water loss. In this case, it will be 799.83 for average mass, 799.36 for monoisotopic mass. Most of the time, you need to calculate the monoisotopic mass.

Here are some simple code adopted from stackoverflow (with minor mod) to calculate peptide mass from sequence:

In [1]:
from re import findall as refindall

aminoacid = {
    'I': 'C6H13NO2',
    'L': 'C6H13NO2',
    'K': 'C6H14N2O2',
    'M': 'C5H11NO2S',
    'F': 'C9H11NO2',
    'T': 'C4H9NO3',
    'W': 'C11H12N2O2',
    'V': 'C5H11NO2',
    'R': 'C6H14N4O2',
    'H': 'C6H9N3O2',
    'A': 'C3H7NO2',
    'N': 'C4H8N2O3',
    'D': 'C4H7NO4',
    'C': 'C3H7NO2S',
    'E': 'C5H9NO4',
    'Q': 'C5H10N2O3',
    'G': 'C2H5NO2',
    'P': 'C5H9NO2',
    'S': 'C3H7NO3',
    'Y': 'C9H11NO3'
}

monoisotopic = {
    'S': 31.972072,
    'C': 12.0000,
    'H': 1.007825,
    'O': 15.994915,
    'N': 14.003074
}


def molecular_weight(molecule):
    return sum(
        monoisotopic[atom] * int(num or '1')
        for atom, num in refindall(r'([A-Z][a-z]*)(\d*)', molecule)
    )


def protein_mass(protein):
    return sum(molecular_weight(aminoacid[char]) for char in protein)  - (len(protein)-1)* 18.0105

In [8]:
for each in aminoacid:
    print("'%s':%s"%(each,molecular_weight(aminoacid[each])))

    
aa_mol = {'I':131.094629,
'L':131.094629,
'K':146.105528,
'M':149.051051,
'F':165.078979,
'T':119.058244,
'W':204.089878,
'V':117.078979,
'R':174.111676,
'H':155.069477,
'A':89.047679,
'N':132.053493,
'D':133.037509,
'C':121.019751,
'E':147.053159,
'Q':146.069143,
'G':75.032029,
'P':115.063329,
'S':105.042594,
'Y':181.073894}

'I':131.094629
'L':131.094629
'K':146.105528
'M':149.051051
'F':165.078979
'T':119.058244
'W':204.089878
'V':117.07897899999999
'R':174.11167600000002
'H':155.069477
'A':89.047679
'N':132.053493
'D':133.037509
'C':121.019751
'E':147.053159
'Q':146.069143
'G':75.032029
'P':115.063329
'S':105.04259400000001
'Y':181.073894


In [18]:
def protein_mass(protein):
    print(protein)
    return sum(aa_mol[char] for char in protein)  - (len(protein)-1)* 18.0105

In [19]:
protein_mass('PEPTIDE')

PEPTIDE


799.360358

In [37]:
import re
aa_mol = {'I':131.094629,
'L':131.094629,
'K':146.105528,
'M':149.051051,
'F':165.078979,
'T':119.058244,
'W':204.089878,
'V':117.078979,
'R':174.111676,
'H':155.069477,
'A':89.047679,
'N':132.053493,
'D':133.037509,
'C':121.019751,
'E':147.053159,
'Q':146.069143,
'G':75.032029,
'P':115.063329,
'S':105.042594,
'Y':181.073894}
def protein_mass(protein):
    peptide_seq = protein[2:-2]
    #print(peptide_seq)
    res = re.findall(r'\[.*?\]', peptide_seq)
    seq = [i for i in re.sub("[\(\[].*?[\)\]]", "", peptide_seq) if i in 'ILKMFTWVRHANDCEQGPSY']
    add_mass = sum([float(i[1:-1]) for i in res])
    #print(res, seq)
    return sum(aa_mol[char] for char in seq)  - (len(seq)-1)* 18.0105 + add_mass

peptide_list=["R.RGGM[15.9949]QEMLPGPPGGPGGR.R","R.RGGM[15.9949]QEMLPGPPGGPGGR.R","K.KGRGSGTSN.K","R.RNVGSPVFPRQ.R","R.RNVGSPVFPRQ.R","R.RGGM[15.9949]QEMLPGPPGGPGGR.R","-.n[42.0106]SRC[57.0215]C[57.0215]PGGPNR.R","P.RASAGGPELDLQGD.R","L.RTQGHNPKC[57.0215]SIMLGDFLFIH.R","P.RGATGGPGDEPLEPA.R","L.RTQGHNPKC[57.0215]SIMLGDFLFIH.R","P.RASAGGPELDLQGD.R","P.RSAAGAHLHVPHAEGGLH.R","V.RSQQPPPISWSVSLSTTSRGEL.R","P.RASAGGPELDLQGD.R","L.RTQGHNPKC[57.0215]SIMLGDFLFIH.R","P.RASAGGPELDLQGD.R","R.RGGM[15.9949]QEMLPGPPGGPGGR.R","P.RASAGGPELDLQGD.R","V.RSQQPPPISWSVSLSTTSRGEL.R","P.RASAGGPELDLQGD.R","P.RSAAGAHLHVPHAEGGLH.R","V.RSQQPPPISWSVSLSTTSRGEL.R","A.KEPPPKK.-","A.RTAGAGSGDE.K","P.RASAGGPELDLQGD.R","L.RTQGHNPKC[57.0215]SIMLGDFLFIH.R","A.KEPPPKK.-","A.RTAGAGSGDE.K","P.RASAGGPELDLQGD.R","S.RFYYLTKGILTC[57.0215]WV.R","-.n[42.0106]SRC[57.0215]C[57.0215]PGGPNR.R","A.KEPPPKK.-","G.RTAANSLR.R","P.RASAGGPELDLQGD.R","P.RSAAGAHLHVPHAEGGLH.R","-.n[42.0106]C[57.0215]LPALPPVSWSRC[57.0215]LV.R","A.KEPPPKK.-"]
for each in peptide_list:
    print(protein_mass(each))


1765.8363040000002
1765.8363040000002
862.4262160000001
1255.6792059999998
1255.6792059999998
1765.8363040000002
1201.5087569999998
1384.6591219999996
2370.1738039999996
1422.674837
2370.1738039999996
1384.6591219999996
1815.925206
2412.241109
1384.6591219999996
2370.1738039999996
1384.6591219999996
1765.8363040000002
1384.6591219999996
2412.241109
1384.6591219999996
1815.925206
2412.241109
822.49673
919.400127
1384.6591219999996
2370.1738039999996
822.49673
919.400127
1384.6591219999996
1818.9499830000002
1201.5087569999998
822.49673
887.4941699999999
1384.6591219999996
1815.925206
1795.9123549999997
822.49673


## Task 1: Can you write a different code/function to perform the exact same job, but faster?
Below is some code that can help you to measure how fast your function runs. In the following code, protein_mass() has been executed 10000 times in 5 seconds.

In [29]:

import timeit, random

random_peptide_sequence_list = [''.join(random.choice('ACDEFGHIKLMNPQRSTVWY') for i in range(100)) for i in range(1000)]

def test_speed():
    for each_peptide in random_peptide_sequence_list: 
        protein_mass(each_peptide)
    
print(timeit.timeit(test_speed, number=10))
print(protein_mass(random_peptide_sequence_list[0]), random_peptide_sequence_list[0])

4.93839897098951
11628.332233999992 HSDFDDTYWKNMEIIQWDISAGARRNVMEPWNCKQCSNGVAMKGEQQVNHPKWGEEATNPEEILIDPTLWYVACQKFDNFLPFLDCSEMVMCLSTVVQTM


## Task 2: Can you sort the peptide list by the mass of each peptides?

Use the above random_peptide_sequence_list as your input, output an sorted peptide list, sorted by the mass

## Task 3 (optional): Can you plot the mass distribution of the peptide list?
Use the above random_peptide_sequence_list as your input, plot a figure with x being mass range of the peptides, binned every 50 Da, and y being the frequency of how many peptides in the list fall into that range?