# BIOC003 - Mass Spectrometry

*Prof. Kostas Thalassinos*

In this notebook we will use Python to revisit some of the concepts we covered in the tandem mass spectrometry lecture.

We will use Python to calculate the sequence and *m/z* of *a,b,y* fragment ions for a given peptide sequence.

The Python concepts covered will include:
- dictionaries (used to store the residue masses)
- for loops


## Simulating peptide fragmentation

For this practical let us assume the sequence of our peptide is:

    AITGVMEK

We will store this sequence as a Python variable named *peptide*:

In [None]:
peptide = "AITGVMEK"

Before we predict the different fragments let us calculate the mass of the intact peptide.

To do so we need to know the residue mass of each amino acid. A very useful way of storing many different bits of information for each amino acid is using a Python dictionary. Python dictionaries are a data structure containing a collection of key-value pairs. 

Go ahead and run the following cell in order to load the dictionary in the variable *amino_acid_properties*


In [None]:
amino_acid_properties = {
"A": {"name": "Alanine",	"single_letter": "A",	"three_letter": "Ala",	"elemental": "C3H5NO",	"mass_mono": 71.03711,	"mass_avg": 71.0779},
"C": {"name": "Cysteine",	"single_letter": "C",	"three_letter": "Cys",	"elemental": "C3H5NOS",	"mass_mono": 103.00919,	"mass_avg": 103.1429},
"D": {"name": "Aspartic acid",	"single_letter": "D",	"three_letter": "Asp",	"elemental": "C4H5NO3",	"mass_mono": 115.02694,	"mass_avg": 115.0874},
"E": {"name": "Glutamic acid",	"single_letter": "G",	"three_letter": "Glu",	"elemental": "C5H7NO3",	"mass_mono": 129.04259,	"mass_avg": 129.1140},
"F": {"name": "Phenylalanine",	"single_letter": "F",	"three_letter": "Phe",	"elemental": "C9H9NO",	"mass_mono": 147.06841,	"mass_avg": 147.1739},
"G": {"name": "Glycine",	"single_letter": "G",	"three_letter": "Gly",	"elemental": "C6H7N3O",	"mass_mono": 57.02146,	"mass_avg": 57.0513},
"H": {"name": "Histidine",	"single_letter": "H",	"three_letter": "His",	"elemental": "C3H5NO",	"mass_mono": 137.05891,	"mass_avg": 137.1393},
"I": {"name": "Isoleucine",	"single_letter": "I",	"three_letter": "Ile",	"elemental": "C6H11NO",	"mass_mono": 113.08406,	"mass_avg": 113.1576},
"K": {"name": "Lysine",	"single_letter": "K",	"three_letter": "Lys",	"elemental": "C6H12N2O",	"mass_mono": 128.09496,	"mass_avg": 128.1723},
"L": {"name": "Leucine",	"single_letter": "L",	"three_letter": "Leu",	"elemental": "C6H11NO",	"mass_mono": 113.08406,	"mass_avg": 113.1576},
"M": {"name": "Methionine",	"single_letter": "M",	"three_letter": "Met",	"elemental": "C4H6N2O2",	"mass_mono": 131.04049,	"mass_avg": 131.1961},
"N": {"name": "Asparagine",	"single_letter": "N",	"three_letter": "Asn",	"elemental": "C3H5NO",	"mass_mono": 114.04293,	"mass_avg": 114.1026},
"P": {"name": "Proline",	"single_letter": "P",	"three_letter": "Pro",	"elemental": "C3H5NO",	"mass_mono": 97.05276,	"mass_avg": 97.1152},
"Q": {"name": "Glutamine",	"single_letter": "Q",	"three_letter": "Gln",	"elemental": "C5H8N2O2",	"mass_mono": 128.05858,	"mass_avg": 128.1307},
"R": {"name": "Arginine",	"single_letter": "R",	"three_letter": "Arg",	"elemental": "C6H12N4O",	"mass_mono": 156.10111,	"mass_avg": 156.1857},
"S": {"name": "Serine",	"single_letter": "S",	"three_letter": "Ser",	"elemental": "C3H5NO2",	"mass_mono": 87.03203,	"mass_avg": 87.0773},
"V": {"name": "Valine",	"single_letter": "V",	"three_letter": "Val",	"elemental": "C5H9NO",	"mass_mono": 99.06841,	"mass_avg": 99.1311},
"W": {"name": "Tryptophan",	"single_letter": "W",	"three_letter": "Trp",	"elemental": "C11H10N2O",	"mass_mono": 186.07931,	"mass_avg": 186.2099},
"Y": {"name": "Tyrosine",	"single_letter": "Y",	"three_letter": "Tyr",	"elemental": "C9H9NO2",	"mass_mono": 163.06333,	"mass_avg": 163.06333},
"T": {"name": "Threonine",	"single_letter": "T",	"three_letter": "Thr",	"elemental": "C4H7O2N",	"mass_mono": 101.04768,	"mass_avg": 101.1051}
}

In this dictionary the key is the one letter amino acid code and for each such key there are many values. For example for each amino acid we can access the following information
- full name
- single letter code
- three letter description
- elemental composition
- monoisotopic mass
- average mass


To see what specific values are stored for the amino acid Glycine we need to access the information stored in the dictionary using the *key* corresponding to Glycine which in our case is 'G'. So we need to access the 'G' entry in our dictionary named *amino_acid_properties* which we do by *amino_acid_properties['G']*

In [None]:
print(amino_acid_properties['G'])

From this you can see that what is returned is another dictionary which now has 6 key-value pairs.
The 6 keys are
- name : the name of the amino acid
- single_letter: the single letter description of the amino acid
- three_letter: the three letter code 
- elemental: the elemental formula
- mass_mono: the monoisotopic mass
- mass_avg: the average mass

You can see that the three letter code is Gly the monoisotopic mass is 57.02146 etc. 


Now to access only one of the parameters for each amino acid e.g. its monoisotopic mass you can do so by typing *amino_acid_properties['G']['mass_mono']*. To access the three letter code you would type *amino_acid_properties['G']['three_letter']*.
You can now store this information in a variable to use like so:

        glycine_monoisotopic_mass = amino_acid_properties['G']['mass_mono']


As you can see the above dictionary makes it easy to store, and access, many different properties for each amino acid.

In [None]:
glycine_monoisotopic_mass = amino_acid_properties['G']['mass_mono']
print(glycine_monoisotopic_mass)

**TASK:** Using what you have just learned can you assign the monoisotopic mass of Serine to a variable and print it?

Great! now we have all the information regarding each amino acid and we know how to access them lets move on and calculate the monoisotopic mass of our peptide. Remember the monoisotopic masses stored in our dictionary are for the *residue* mass i.e. the mass of the peptide minus H, minus OH groups

$ResidueMass = AminoAcidMass - H_2O$

In order to calculate the mass of the intact peptide we therefore need to add the mass for each residue and then add the mass of H and mass of OH. 

$PeptideMass = \sum Mass\_of\_residues + H_2O$

Use the folloging values for the mass of H and O

$H = 1.0078$

$O = 15.9949$

OK to do so we will assign the mass of H and OH to new variables called *mass_H* and *mass_O*. We will then go through each amino acid in our peptide's sequence and use it as input to our dictionary so that we can access the monoisotopic mass for each of these amino acids. Python has a very easy way of *looping* through strings using the following 

        for amino_acid in peptide:
        
so that at each interation of this for loop the amino_acid varable will hold the value of one amino acid. To see this in practice run the code in the cell below
            



In [None]:
for amino_acid in peptide:
    print(amino_acid)

As we saw above, to access the monoisotopic mass of glycine we typed

        amino_acid_properties['G']['mass_mono']

so now we can use the amino acid letter at each iteration of the loop as input in our dictionary to access its monoisotopic mass 
        
        amino_acid_properties[amino_acid]['mass_mono']
        
Finally, we assign a new variable called *peptide_mass* to hold the total value of the peptide. 

We bring it all together in the code below, have a look and make sure you understand the logic here

In [None]:
mass_H = 1.0078
mass_O = 15.9949

peptide_mass = 0 #variable to hold the mass of the peptide

for amino_acid in peptide: # go though each amino acid in the peptide sequence starting from the first (N-terminal part)
    peptide_mass += amino_acid_properties[amino_acid]['mass_mono'] #keep adding the mass for each amino acid

peptide_mass =  peptide_mass + (2*mass_H) + mass_O #add the mass of H2O

print(peptide_mass)

**TASK** In the code above we calculated the monoisotopic mass of the peptide. What changes would you make to the above code to calculate the average mass of the peptide?

## Calculating fragment ion masses

OK let's now move onto calculating the mass of *a,b,y* fragment ions. Remember that *a* and *b* ions retain the N-terminal part of the peptide while *y* ions retain the C-terminal part of the peptide. In the code below we will simulate fragemtation along the peptide backbone. There are many different ways of doing the following in Python but here we will cover one which makes use of Pythons substrings. 

For our peptide AITGVMEK the b ion series will be

        - b1: A
        - b2: AI
        - b3: AIT
        - b4: AITG

and so on. 

So the b1 ion will be the first amino acid the b2 ion the first and second amino acid and so on, or you could say the b1 ion is between positions 1:1 of the sequence b2 between positions 1:2 and so on. In Python you can access a substring of a string using this notation

        substring = string[start:end]
        
one thing to note is that the first position is actually 0 and not 1. Also if an end index is not included it will be assumed to be the end of the string. For example to get the substring from the 4th amino acid to the end for our peptide you need to execture the following

         peptide[3:]
         
or the get from the beggining of the peptide to the 3rd amino acid we can use

        peptide[:3]


In [None]:
peptide[:3]

**TASK** Try extracting different parts of the peptide here. A good description of how to use substrings can be found here
https://www.freecodecamp.org/news/how-to-substring-a-string-in-python/

OK, to extract a substring as we saw in the above example we need to have an index. We previously used the below code to loop through the amino acids in the sequence

        for amino_acid in peptide:
            peptide_mass += amino_acid_properties[amino_acid]['mass_mono']
            
There is a way of also getting a counter (which we will use as an index) while looping through the sequence and for this we will use *enumerate()*. The above example can be written as

        for counter, amino_acid in enumerate(peptide):
             print(counter, amino_acid)
             
Run this code in the below cell to see the output


In [None]:
for counter, amino_acid in enumerate(peptide):
     print(counter, amino_acid)

We still get each amino acid stored in the variable *amino_acid* but we also get the value of the counter stored in the varaible *counter*

For more on enumerate see https://realpython.com/python-enumerate/



We can now use the following code to list the name and sequence of all the *b* fragment ions

In [None]:
for counter, amino_acid in enumerate(peptide):
    ion_number = counter
    ion_sequence = peptide[:counter]
    
    print('b', ion_number, ion_sequence)

We can tidy up the above code to not dispay the first line which is empty and also pretify the name of the ion

In [None]:
for counter, amino_acid in enumerate(peptide):
    if(counter > 0): # only print ions if the counter is larger than 0
        ion_number = 'b' + str(counter) #we need to convert the counter which is an integer to string in order to concatenate with with the letter b
        ion_sequence = peptide[:counter]
    
        print(ion_number, ion_sequence)

What remains now is to calculate the mass of each fragment ion. Remember from the lectures that the mass of *b* ions is:

$\sum Mass\_of\_residues\_retained + N\_terminal$
    
where in our case N_term is H

We can create a function that will take as input the fragment sequence and return the mass (more precisely the +1 *m/z* value for this fragment).

In [None]:
def calculate_b_ion_mass(fragment_sequence):
    
    mass_N_term = 1.0078
    fragment_mass = 0 

    for frag in fragment_sequence:
        fragment_mass += amino_acid_properties[frag]['mass_mono']

    fragment_mass =  fragment_mass + mass_N_term
    
    return fragment_mass

Now that we have that function we can combine evertyhing together to get the *b* fragments with their names and *m/z* ratios

In [None]:
temp_mass = 0

for counter, amino_acid in enumerate(peptide):
    if(counter > 0): # only print ions if the counter is larger than 0
        
        
        ion_number = 'b' + str(counter) #we need to convert the counter which is an integer to string in order to concatenate with with the letter b
        ion_sequence = peptide[:counter]
        ion_mass = calculate_b_ion_mass(ion_sequence)
        
        print(ion_number, ion_sequence, ion_mass)
    
        

**TASK** Given all that you know now can you calculate the sequence and the mass of the *a* and *y* ions?

I hope you enjoyed this practical. As mentioned, there are many different ways of doing the same thing so go ahead and experiment. Have fun!