# **Lesson 1 - Measuring Mass**

In this first lesson, we'll learn about how a mass spectrometer measures mass. The focus of this lesson, and all lessons, is on data exploration and not the chemistry or engineering of the physical mass spectrometer instrument.

## **Assumptions**

It is assumed that the reader is already familiar with basic molecular biology and introductory chemistry, as well as terms like polypeptides and amino acids. If needed, see these links for more information:
- [Amino Acids](https://en.wikipedia.org/wiki/Amino_acid)
- [Peptides](https://en.wikipedia.org/wiki/Peptide)

## **Goals**

The goal of this lesson is to help you become familiar with the elements of a mass spectrum and where they come from. At the end of this lesson, you should be able to:

 - Calculate the mass of an amino acid sequence, or peptide.
 - Calculate the mass-to-charge ratio of an amino acid sequence, or peptide, given different charge states.
 - Identify isotopic envelopes in a mass spectrum and infer charge from the envelope.

## **Context**

We will be analyzing a mass spectrum which contains a single peptide. We will be looking at the spectrum using Python programming and several popular and helpful libraries created by the proteomics community.

## **Using this Tutorial**

This tutorial is designed to be interactive, and you are encouraged to change the code and explore. To do this, you'll need to save a copy of this so that you have editing permissions. Use `File->Save a copy in Drive` to make an editable copy for yourself. Colab notebooks consist of text cells (like this one) and code cells. You interact with the notebook by executing (running) the code cells by clicking the "play button" in each cell. You can also run all cells at once by using `Runtime->Run all`.

---

## **Part 1. Installation and Setup**

Before diving into the practical aspects of measuring mass, let's prepare our environment by installing the necessary Python packages and defining the necessary functions. For those unfamiliar with Google Colab, it's worth noting that the platform operates on a virtual machine (VM) so the default python installation might not have all the packages that we'll need. To install additional packages on the VM, you can utilize the `!pip` command. These packages will enable us to analyze and visualize the data effectively. To apply the concepts we've learned, we'll be working with real data, so we'll be loading the data files into the Colab environment using `gdown`.

In this notebook, some code cells have been 'hidden' for brevity, like the set-up one below. You can recognize these because they just have a play button and a small text prompt `Show code`. You may want to look at this later in the lesson, but for now you can probably just click through. The first task is to establish the basic ideas behind **mass**.

In [None]:
# @title Run this cell to set up the coding environment, including installing and loading necessary Python packages and loading in the data files.
%%capture
!pip install pyteomics==4.6.1
!pip install gdown
!pip install plotly==5.18.0

import pyteomics
from pyteomics import mzml, auxiliary
import gdown
import numpy as np
import plotly.tools as tls
import plotly.graph_objects as go

# Here we download the mass spectrometry file to the VM
!gdown 1DQ_-_fB8HxmA9AivVDQQEnXMT942a4d9
mzml_path = '/content/Ova_200uM_70_30_FullMS1.mzML'

In [None]:
# @title Run this cell to declare a function that plots a mass spectrum.

def plot(spectrum, x_min = None, x_max = None, title = None):
    X = spectrum['m/z array']
    Y = spectrum['intensity array']
    Y_max = max(Y)
    Y_percentage = [(y/Y_max)*100 for y in Y]

    if not spectrum:
        x_min = spectrum['lowest observed m/z']
        x_max = spectrum['highest observed m/z']

    trace = go.Scatter(
        x = X,
        y = Y_percentage,
        mode = 'lines',
        name = 'Spectrum',
        line=dict(color='black')
    )

    layout = go.Layout(
        title = title,
        xaxis = dict(
            title = 'm/z',
            range = [x_min, x_max],
            linecolor='black',
            mirror=True
        ),
        yaxis = dict(
            title = 'Intensity (%)',
            range = [0, 105],
            linecolor='black',
            mirror=True
        ),
        plot_bgcolor='white',
        paper_bgcolor='white'
    )

    fig = go.Figure(data=[trace], layout=layout)
    fig.show()

## **Part 2. Calculating the Mass of a Peptide**

A **mass spectrometer**, simply put, measures mass. Although this may seem obvious, it's really important to understand. You can think of a mass spectrometer as a really big (and expensive) kitchen scale: Ultimately it just measures the mass of whatever you put in it.

If you put in a single peptide, it will measure that peptide.

If you give it a mix of different molecules, it will measure the mass of each one and tell you how much of each is there.

For this tutorial, we're using a controlled experiment where we know exactly what's in our sample. The sample for this lesson is a single **polypeptide**: serine-isoleucine-isoleucine-asparagine-phenylalanine-glutamic acid-lysine-leucine. In proteomics, we usually use single-letter codes for amino acids. So, our peptide is `SIINFEKL`.

Knowing the amino acids in our peptide, we can work out its mass. Calculating the mass of peptides is a fundamental part of computational mass spectrometry - every analysis tool does these calculations.

[This website](http://www.matrixscience.com/help/aa_help.html) is a great reference for all amino acids, their 1-letter codes, and their masses. For our calculations, we use the [**monoisotopic mass**](https://en.wikipedia.org/wiki/Monoisotopic_mass).

In the next code cell, we've made a dictionary that matches each amino acid's 1-letter code with its monoisotopic mass. Run the cell to access the dictionary later on in the lesson.

In [None]:
aa_mass = {'A': 71.037114, 'R':156.101111 , 'N': 114.042927,
           'D': 115.026943, 'C': 103.009185, 'E': 129.042593,
           'Q' : 128.058578, 'G': 57.021464, 'H': 137.058912,
           'I': 113.084064, 'L': 113.084064, 'K': 128.094963,
           'M' : 131.040485, 'F':  147.068414, 'P':  97.052764,
           'S': 87.032028, 'T': 101.047679, 'U': 150.95363,
           'W': 186.079313, 'Y': 163.06332, 'V': 99.068414}

**Please Note:** the weights in our dictionary are for individual amino acids as they appear inside a peptide (bonded together), not as free amino acids. To get the mass of the whole peptide, we add up the masses of each amino acid and then add the mass of a hydrogen (H) at the **N-terminus** (beginning) and the mass of an oxygen and a hydrogen (OH) at the **C-terminus** (end), as shown in the code cell below.

In [None]:
peptide = 'SIINFEKL'
neutral_mass = 0
mass_Hydrogen = 1.0078
mass_Oxygen = 15.994915

for aa in peptide:
  neutral_mass+=aa_mass[aa]
neutral_mass += mass_Hydrogen #N-terminal hydrogen
neutral_mass += (mass_Hydrogen + mass_Oxygen) #C-terminal OH
neutral_mass = round(neutral_mass,2)

print (neutral_mass)

962.54


In the mass spectrometry and proteomics community, this value is known as the **neutral mass**. We measure it in **Daltons (Da), or unified atomic mass units (u)**. Usually, we'd say something like "This peptide weighs 962.54". The accuracy of a mass spectrometer can change based on its design, but most modern machines are accurate out to 2-4 decimals.

(To keep things simple in this tutorial, we'll use two decimal precision.)

 So, `SIINFEKL` weighs 962.54. Now let's take a peek at our data file.

## **Part 3. Analyzing Mass Spectrometry Data**

The native format for mass spectrometry data is unique to each instrument vendor. As a community standard, we use a free and open format called [**mzML**](https://www.psidev.info/mzML). Although it's a complex XML-style format, the parts we're interested in for the moment are two matched arrays: an *m/z array* (ratio of mass to charge) and an *intensity array* (the number of molecules measured at the ratio). In the code below, we pull a single spectrum out of a file, dig through the m/z array, and find our peptide.

In [None]:
mzml = pyteomics.mzml.MzML(mzml_path)
scan_num = '1'
my_id = 'controllerType=0 controllerNumber=1 scan='+ scan_num
spectrum = mzml.get_by_id(my_id)
np.set_printoptions(threshold=np.inf)


# The print() below will print out the variable 'spectrum'. It's a lot
# You can print out the file to your screen by uncommenting the print statement below
# if you want so that you can look for the m/z of our peptide, which was a 962.54.

# print(spectrum)

Well, I love looking at arrays as much as the next person, but it is much easier to understand this data if we plot it out and examine it visually. Below is some code to plot this out.

A few things in the image are surprising, and we'll address them one by one.

In [None]:
# Plot the spectrum

# This function is defined above in Part 1 - take a look at the code up
#   there to see what it is doing
plot(spectrum, title = peptide + " (neutral mass: "+ str(neutral_mass) + ")")

Right off the bat, we notice that there are two main peaks (*note: all peaks are given an `Intensity %` proportional to the tallest peak*). You can zoom in and out on the plot to get a more detailed look at the peaks by dragging over an area with your mouse to zoom in on that area and double-clicking on the plot to zoom back out. The little peaks of low intensity can be dismissed as **noise** (small errors in the mass spectrometer) and/or contaminants. However, there are clearly two main things that are being measured. Unfortunately, neither of them are at mz 962.54. Let's find out why.

To understand this, we need to recognize how peptides are introduced into the mass spectrometer. To get a peptide into the mass spectrometer, we **ionize** it. This means we add a positive charge to the peptide through protonation (attaching a proton). Sometimes the peptide takes up one proton, sometimes two. There's a lot of chemistry here that I'm glossing over, but for this lesson, it's enough to know that this peptide in our spectrum acquires either one or two protons. A proton has mass, about 1 amu, which explains why one of our peaks is at 963.5 and not 962.5 as calculated above.

**But why are the two main peaks so far apart?**

Earlier, we mentioned that a mass spectrometer measures mass. However, in reality, it measures mass-to-charge, abbreviated as **m/z**. A proton is charged, so the number of protons affects the m/z value. For our peptide, the m/z calculations for the two charge states are:

**Singly charged peptide:**

$\text{m/z} = \frac{\text{neutral_mass }+ \space \text{proton_mass}}{1}$

**Doubly charged peptide:**

$\text{m/z} = \frac{\text{neutral_mass }+\space(2 \space \times \space \text{proton_mass)}}{2}$

The calculations for this are in the code cell below and match what we see in the spectrum above.

In [None]:
proton_mass = 1.007276466879 # a proton is not the same as a hydrogen, they differ by an electron
mass_charge_1 = (neutral_mass + proton_mass)/1
mass_charge_2 = (neutral_mass + 2 * proton_mass)/2

print("mass_charge_1 = ",round(mass_charge_1,2))
print("mass_charge_2 = ",round(mass_charge_2,2))

mass_charge_1 =  963.55
mass_charge_2 =  482.28


Now zoom into the graph near `mass_charge_1`. (Drag over an area with your mouse to zoom in on that area. Double-click to zoom back out).

In [None]:
# Run this cell to recreate the plot here
plot(spectrum, title = peptide + " (neutral mass: "+ str(neutral_mass) + ")")

They match! **But what else do you notice?**

When we take a closer look at our singly charged peptide in the graph above, we see something interesting. Beside our expected peak at 963.55, there are also pronounced peaks at 964.55, 965.55, and 966.55. What could these represent?

These peaks make up what is called the **isotopic envelope**, and they exist because of the presence of carbon **isotopes**, i.e. <sup>13</sup>C. As you may remember from chemistry, 99% of carbon is <sup>12</sup>C. It has six protons and six neutrons for an atomic mass of 12. However, about 1% of carbon is [<sup>13</sup>C](https://en.wikipedia.org/wiki/Carbon-13), which has an *extra* neutron. This version of carbon weighs one amu more. Therefore, if a peptide species had one <sup>13</sup>C, it would show up as heavier in our mass spectrum. For our peptide, `SIINFEKL`, it would show up at 964.55.

Our peptide of interest, `SIINFEKL`, has a chemical formula of C<sub>45</sub>H<sub>74</sub>N<sub>10</sub>O<sub>13</sub>. That's 45 carbon atoms. Odds are that some of those might be <sup>13</sup>C. Using standard probabilistic sampling, we can calculate the expected number of heavy carbons in a peptide. Looking at our spectrum, we see that the most abundant peak is at 963.55. This corresponds to our original calculated m/z, and so is peptides with zero heavy carbons. The peak at 954.55 m/z corresponds to versions of the peptide that have one heavy carbon out of the 45 total carbons. There are also peaks at higher m/z with two or three heavy carbons.

You can play around with this, if you want, using an interactive website [here](https://www.envipat.eawag.ch/index.php). Isotopes are really important and very helpful for computational algorithms in mass spectrometry data analysis. Everything that has carbon will have an isotopic envelope. The relative intensity between a monoisotopic peak and the heavier peaks will change. As a peptide has more carbons, we expect an isotope more often.

To summarize, in the first peak of our zoomed-in graph, there are zero <sup>13</sup>C atoms. In the second peak there is one <sup>13</sup>C atom, the third has 2, etc. creating our isotopic envelope.

Now let's compare it to the isotopic envelope of `mass_charge_2` around 482.28. What do you notice about the space between each isotopic peak?

In [None]:
# Run this cell to recreate the plot here
plot(spectrum, title = peptide + " (neutral mass: "+ str(neutral_mass) + ")")

Looking at this next zoomed in section of our spectrum, we also see the characteristic set of peaks that is an isotopic envelope. However, if you look carefully at the spectrum above, you'll notice that there is something different than what we saw for the singly charged peptide. In this one, the spacing between peaks is about 0.5 m/z, where our peaks for the singly charged peptide were spaced about 1.0 m/z apart. This is because the doubly charged peptide has a charge $z=2$. Therefore, the mass of an extra neutron (~1 amu) shows up as: $\frac{\text{1 mass unit}}{\text{2 charges}}$, or 0.5 m/z.

## **Conclusion**

That's the end of this lesson! The main concepts that we covered in this lesson include calculating the mass of peptides, calculating the m/z of peptides at various charge states, analyzing a mass spectrum and identifying the isotopic envelope, and using the isotopic envelope to infer the charge state for a set of molecules. Additionally, you learned how to read in a file with Pyteomics, extract the data you need, and graph it.

In this lesson, we worked with a mass spectrum containing only a single peptide. In [Lesson 2](https://colab.research.google.com/drive/15cwLXSNBbVSGe1tdFB-VikMSgGXdmkKp#scrollTo=UqRp86vJohVU), you will learn how to read a mass spectrum with a more complex mixture of peptides.

## **Lesson 1 Terms**

* **mass spectrometer**: scientific instrument that measures the ratio of mass and charge of molecules
* **polypeptide**: two or more amino acids bonded together; may be a whole protein or pieces of a protein
* **monoisotopic mass**: mass calculated using the exact mass of the most abundant isotope of each element
* **N-terminus**: the start of a protein or peptide
* **C-terminus**: the end of a protein or peptide
* **neutral mass**: calculated mass of an uncharged peptide
* **Dalton (Da), or unified atomic mass units (u)**: standard unit of mass used in mass spectrometry; one Da or u is approximately the mass of a single proton or neutron
* **mzML**: an XML-like file format used by the community to store mass spectrometry data
* **noise**: small errors in the mass spectrometer resulting in a measurement peak in the spectrum
* **ionization**: the process in which molecules acquire a charge (whether positive or negative); for these introductory lessons, we mean a *positive* charge unless otherwise stated
* **m/z**: the mass-to-charge ratio of a molecule; $\text{m/z = }\frac{\text{neutral_mass }+ \space \text{(charge * proton_mass)}}{\text{charge}}$
* **intensity**: the abundance of molecules at a particular m/z
* **isotope**: a different version of an element; has the same number of protons but a different number of neutrons, and therefore a different mass, than other versions of the same element; <sup>12</sup>C and <sup>13</sup>C are isotopes of the element carbon (C)
* **isotopic envelope**: a set of peaks representing the same molecule but with varying amounts of the <sup>13</sup>C isotope