# **Lesson 4 - Identifying Peptide Sequences with MS/MS**

In this lesson, we'll learn about how we can identify the amino acid sequence of peptides in a complex sample. The focus of this lesson, and all lessons, is on data exploration and not the chemistry or engineering of the  mass spectrometer.

## **Assumptions**

It is assumed that the reader has completed the prior lessons and is familiar with basics of biology and introductory chemistry. From the previous lessons, it is critical that you understand the limitations of using m/z and chromatography alone in identification.

## **Goals**
At the end of this lesson, you should be able to:
- Understand precursor isolation, fragmentation, and measurement in MS2.
- Generate a b/y ion ladder.
- Understand the successive mechanics of an LC-MS/MS experiment.

## **Context**

In [Lesson 3](https://colab.research.google.com/drive/1WvWy2ULWr_9FqyzhbcugisC9qs2J5fzr#scrollTo=MyPfCYIXGjiq), we learned that
the more information we have, the easier it is to identify a peptide. We saw that when we put a protein digest in an LC-MS experiment, different peptides have different retention times. This lets us distinguish between multiple peptides even when they have the same m/z value.

Although LC-MS may help us distinguish between different peptides, it still *doesn't provide enough information* to identify the amino acid sequence of the peptide. In this lesson, we'll learn how to do this with **Tandem Mass Spectrometry** (also known as **MS/MS** and **MS2**).

## **Using this Tutorial**

This tutorial is designed to be interactive, and you are encouraged to change the code and explore. To do this, you'll need to save a copy of this so that you have editing permissions. Use `File->Save a copy in Drive` to make an editable copy for yourself. Colab notebooks consist of text cells (like this one) and code cells. You interact with the notebook by executing (running) the code cells by clicking the "play button" in each cell. You can also run all cells at once by using `Runtime->Run all`.

---


## **Part 1. Installation and Setup**

Before diving into the practical aspects of tandem mass spectrometry, let's prepare our environment by installing the necessary Python packages and defining the necessary functions. These packages will enable us to: analyze and visualize the data effectively. To apply the concepts we've learned, we'll be working with real data, so we'll be loading the data files into the Colab environment using `gdown`.

In this notebook, some code cells have been 'hidden' for brevity, like the next few below. You can recognize these because they just have a play button and a small text prompt `Show code`. In addition to the setup code, these include several functions that we will use throughout the lesson - some functions from previous lessons and some plotting code. You may want to look at this later in the lesson, but for now you can probably just click through. The first task is to establish the basic ideas behind **fragmentation**.

In [None]:
# @title Run this cell to set up the coding environment, including installing and loading necessary Python packages and loading in the data files.
%%capture
!pip install gdown
!pip install pyteomics==4.6.1
!pip install plotly==5.18.0
!pip install pandas
!pip install spectrum_utils==0.4.2

import spectrum_utils.plot as sup
import spectrum_utils.spectrum as sus
import pyteomics
from pyteomics import mzml, auxiliary
import gdown
import plotly.io as pio
import plotly.tools as tls
import plotly.graph_objects as go
import pandas as pd
import random


# Download the mzml file that has the MS2 spectra
!gdown 1V91Dspzak-tGO6YtYKjJKT50UWncyqn8
mzml_path = '/content/04-17-23_CA_Tryp_HCD_10min.mzML'

In [None]:
# @title Run this cell to create our amino acid dictionary (from Lesson 1).
aa_mass = {'A': 71.037114, 'R':156.101111 , 'N': 114.042927,
           'D': 115.026943, 'C': 103.009185, 'E': 129.042593,
           'Q' : 128.058578, 'G': 57.021464, 'H': 137.058912,
           'I': 113.084064, 'L': 113.084064, 'K': 128.094963,
           'M' : 131.040485, 'F':  147.068414, 'P':  97.052764,
           'S': 87.032028, 'T': 101.047679, 'U': 150.95363,
           'W': 186.079313, 'Y': 163.06332, 'V': 99.068414}

In [None]:
# @title Run this cell to declare a function that creates a b-/y-ion ladder for a peptide.

# This will make a b/y ion ladder for any given peptide and put it in a dataframe
def make_ion_ladder(peptide, aa_mass):
    b_ions = {}
    y_ions = {}

    mass_Hydrogen = 1.0078
    mass_Oxygen = 15.994915
    proton_mass = 1.007

    '''
    Note: In the following functions, if you look closely to the calculations,
    you will see some additions of protons that are different to what we did in
    lesson 1 & 2. This is because instead of enzymatically digesting a protein,
    we are violently fragmenting the peptides milliseconds before measuring them.
    The physics is a little more complicated, so don't worry too much about it.

    If you really want to know the chemistry/physics behind this, you can read
    about it in this paper: https://cse.sc.edu/~rose/790B/papers/dancik.pdf
    '''
    # Generate b-ions
    b_mass_current = 0
    b_ion = ''
    fragment = 0
    for aa in peptide:
        b_ion += aa
        if (b_ion != peptide):
          b_mass_current += aa_mass[aa]
          b_ions[b_ion] = b_mass_current + proton_mass # mass of the charge on fragment

    # Generate y-ions
    y_mass_current = mass_Hydrogen + mass_Oxygen #adds terminal OH
    y_mass_current += proton_mass
    y_ion = ''
    fragment = 0
    for aa in peptide[::-1]:
        y_ion += aa
        if (y_ion[::-1] != peptide):
          y_mass_current += aa_mass[aa]
          y_ions[y_ion[::-1]] = y_mass_current + proton_mass #mass of charge on fragment

    # Populate dataframe
    data = {
        'b#': [b+1 for b in range(len(peptide)-1)],
        'b_ion_m/z': [b_ions[b_key] for b_key in b_ions.keys()],
        'b_ion_sequence': [b_key for b_key in b_ions.keys()],
        'y_ion_sequence': [y_key for y_key in y_ions.keys()][::-1],
        'y_ion_m/z': [y_ions[y_key] for y_key in y_ions.keys()][::-1],
        'y#': [len(peptide)-i-1 for i in range(len(peptide)-1)]
    }

    # Format dataframe
    df = pd.DataFrame(data)
    # df = df.style.set_properties(
    #     subset=['b_ion_sequence'],
    #     **{'text-align': 'left'}
    # ).format({
    #     'b_ion_m/z': '{:,.2f}',
    #     'y_ion_m/z': '{:,.2f}'
    # }).set_table_styles([{
    #     'selector': 'thead th',
    #     'props': [('vertical-align', 'bottom'), ('text-align', 'left')]
    # }, {
    #     'selector': 'th.index_name',  # targeting the index name specifically
    #     'props': [('vertical-align', 'bottom')]
    # }])

    # print(df)
    return(df)

In [None]:
# @title Run this cell to declare a function that gets an MS2 spectrum object.

def get_MS2_object(mzml_path, scan, peptide = None):
    su_spectrum = None
    with pyteomics.mzml.read(mzml_path) as spectra:
        for spectrum in spectra:
            scanNumber = int(spectrum['id'].split('=')[-1])
            if scanNumber == scan:
                # This finds the corresponding values in the .mzml file to create our MS2
                spectrum_id = spectrum['id']
                mz = spectrum['m/z array']
                intensity = spectrum['intensity array']
                retention_time = spectrum['scanList']['scan'][0]['scan start time']
                precursor_mz = spectrum['precursorList']['precursor'][0]['isolationWindow']['isolation window target m/z']
                precursor_charge = int(spectrum['precursorList']['precursor'][0]['selectedIonList']['selectedIon'][0]['charge state'])

                su_spectrum = sus.MsmsSpectrum(spectrum_id, precursor_mz, precursor_charge, mz, intensity, retention_time=retention_time)

                # Process the spectrum
                su_spectrum = (su_spectrum.filter_intensity(0.05, 100)
                               .remove_precursor_peak(fragment_tol_mass=0.5, fragment_tol_mode='Da')
                               .scale_intensity('root'))
                break
    # Formatting
    if su_spectrum:
        fragment_tol_mass = 0.5
        fragment_tol_mode = 'Da'  ## for some reason, if I use 'ppm' it doesn't work

        # If given the peptide, spec_utils can annotate the peaks
        if peptide:
          su_spectrum = su_spectrum.annotate_proforma(peptide, fragment_tol_mass, fragment_tol_mode, ion_types='by', max_ion_charge=2)
    return su_spectrum

In [None]:
# @title Run this cell to declare a function that plots an MS2 spectrum.

def plot_MS2(ms2_spectrum):
    ax = sup.spectrum(ms2_spectrum)
    plotly_fig = tls.mpl_to_plotly(ax.figure)
    plotly_fig['layout']['plot_bgcolor'] = 'white'
    plotly_fig['layout']['xaxis']['showline'] = True
    plotly_fig['layout']['xaxis']['linecolor'] = 'black'
    plotly_fig['layout']['xaxis']['linewidth'] = 2
    plotly_fig['layout']['yaxis']['linecolor'] = 'black'
    plotly_fig['layout']['yaxis']['linewidth'] = 2
    plotly_fig.show()

## **Part 2. Fragmentation: Why and How?**

In [Lesson 2](https://colab.research.google.com/drive/15cwLXSNBbVSGe1tdFB-VikMSgGXdmkKp?usp=sharing), we learned how to break down a protein into several peptides by digesting it with an enzyme called trypsin. In this lesson, we will learn how to break down peptides into even smaller pieces. Instead of using an enzyme, we'll split the peptides with energy in a process known as **fragmentation**. The peptide before fragmentation is called a **precursor** and the pieces resulting from fragmentation are called **fragments**.

*Why would we want to fragment a peptide?*

Two different peptides can have the same total mass but different sequences. Fragmentation helps distinguish between such peptides because fragments are sequence-dependent.

*Fragmentation happens inside the mass spectrometer.*

When we introduce a peptide to a mass spectrometer, we don't input just one instance of it. Our experiment with the protein carbonic anhydrase probably contained millions of copies of the same protein. This corresponds to millions of copies of each peptide.
  
Inside the MS instrument, there is a special chamber which fragments peptides through applying energy, causing it to break somewhere along its sequence. Almost always the break occurs between two adjacent amino acids. As with other lessons, we are avoiding the gory chemistry details about what this *energy* actually is. Today, we're just going to talk about a generic energy that can break a peptide into two pieces.

Where is the break? Between which two amino acids? That's a great question. Although there are some more and less favorable pairs, our awesome chemistry/physics friends have created a mass spectrometry instrument which is fairly unbiased. This means that for any given peptide instance, the fragmentation could occur between any two amino acids.

Let's think about `VLDALDSIK`, a peptide from **Lesson 2**. We might get a fragmentation between amino acids 3 and 4. This would produce two fragment ions: the front half ion `VLD`, and the back-half ion `ALDSIK`. Alternatively, fragmentation might occur between amino acids 6 and 7. This would produce a front-half ion `VLDALD` and and a back half ion `SIK`.

Remember that our experiment has millions of copies of a peptide. By having many peptide copies in our experiment, we get multiple instances of each possible site of fragmentation.

## **Part 3. The Ion Ladder**

You may think that creating a whole bunch of broken peptide pieces is a mess. But, it's actually a great strategy to help us identify the peptide sequence. Consider the first two front ions created from `VLDALDSIK`. These would be `V` and `VL`, at m/z 100.08 and 213.16, respectively. If we were looking at a spectrum and saw these two peaks, they would have this nice relationship - they differ by exactly the mass of a leucine amino acid. In fact, there is a sequence of increasing or decreasing fragments from each side, with **each fragment being one amino acid different from the previous one**.

The fragment ions have a special name. We call the front fragment the **b-ion**, and the back part the **y-ion**. We also number the fragment ions. In our example, the first (smallest) b-ion `V` we call *b1*. The next one, `VL`, we call *b2*.

By looking at the difference in m/z between two consecutive fragment ions, we can determine which amino acid was added or removed. By arranging all these fragment ions in sequence, we can decipher the original structure of the peptide. We call this sequence of fragment ions the **ion ladder**.

Let's make the full ion ladder for our peptide `VLDALDSIK`. Run the code cell below to create the ion ladder.

In [None]:
# This function is defined above in Part 1 - take a look at the code up
#   there to see what it is doing
make_ion_ladder('VLDALDSIK', aa_mass)

Unnamed: 0,b#,b_ion_m/z,b_ion_sequence,y_ion_sequence,y_ion_m/z,y#
0,1,100.075414,V,LDALDSIK,874.486898,8
1,2,213.159478,VL,DALDSIK,761.402834,7
2,3,328.186421,VLD,ALDSIK,646.375891,6
3,4,399.223535,VLDA,LDSIK,575.338777,5
4,5,512.307599,VLDAL,DSIK,462.254713,4
5,6,627.334542,VLDALD,SIK,347.22777,3
6,7,714.36657,VLDALDS,IK,260.195742,2
7,8,827.450634,VLDALDSI,K,147.111678,1


## **Part 4. The MS2 Spectrum**

So, we started off with an original peptide (called the **precursor**) and blew it to pieces with energy! Remembering that the original peptide was charged (from ionization), the fragments retain a charge. We call these charged fragments **fragment ions**. After fragmentation, the mass spectrometer measures and records the m/z and intensity for observed fragment ions and calls this set of measurements a **tandem mass spectrum** (or an **MS2**, or **MS/MS**)

Although the same instrument is used to measure an MS1 and an MS2, we must remember their differences. An MS1 is a spectrum showing the m/z of anything being introduced into the mass spectrometer (the *precursors*). An MS2 is a spectrum showing the m/z of *fragment ions*.

Below is an MS2 spectrum for our peptide `VLDALDSIK`.


In [None]:
# These functions are defined above in Part 1 - take a look at the code up
#   there to see what they are doing
ms2_spectrum_unannotated = get_MS2_object(mzml_path, 5672)
plot_MS2(ms2_spectrum_unannotated)

We can use the python package `spectrum_utils` to annotate our spectrum using our peptide's ion ladder. This means it will color all the observed b-ions blue and all the y-ions red in an MS2 for a given peptide sequence.

Run the code cell below to generate an annotated MS2 spectrum for `VLDALDSIK`.

In [None]:
ms2_spectrum_annotated = get_MS2_object(mzml_path, 5672, peptide = 'VLDALDSIK')
plot_MS2(ms2_spectrum_annotated)

This is a beautiful spectrum! Notice that we have all possible y-ions observed (and colored red). Hover your mouse over the y7 peak and note that its measured m/z was 761.41. The m/z we calculated in the table above is 761.40. The difference between the theoretical (calculated) m/z and the measured m/z is 0.01 m/z, or one-one hundredth of the mass of a proton. That's a very accurate scale!

While we saw all of the y-ions, we see only two blue b-ions. The reason we don't see all of the b-ions is complicated, but sometimes certain fragmentation processes or conditions in the mass spectrometer can favor the formation of one type of ion over another (more chemistry we're going to skip today). The black peaks are ions that do not match the expected b/y ions for the given peptide. These ions are sometimes called *noise*.

From this spectrum and the annotated ions, we are able to confirm that the sequence that we see in the spectrum is, in fact, `VLDALDSIK`.

## **Part 5. The Instrument Duty Cycle**

Well, we saw a beautiful MS2 spectrum. But, I'm sure you're thinking to yourself that there were many different peptides in our sample. You may also be wondering how this MS2 capability relates to **Lesson 3**'s topic of LC. This leads us to the question, **When is the MS2 spectrum acquired?**

A typical experiment combines all of the techniques we've discussed up to now: LC, MS1 and MS2. This is most often called **LC-MS/MS**, and is a complex dataset that has both MS1 and MS2 data acquired during the LC timeframe. A typical routine goes like this:

1. **MS1 Scan**: The mass spectrometer measures the m/z values of all ions being introduced to the MS. This is a "full scan".

2. **Ion Selection**: Immediately after the MS1, the instrument's software identifies the most prominent m/z peaks and marks them to be further analyzed.

3. **Ion Isolation**: From the short list created in step 2, the MS instrument will isolate a single specific ion. All other ions with a different m/z will be thrown away. There is some amazing physics/engineering that allows the instrument to isolate one m/z and remove all others, but we won't discuss that here. The m/z value of the isolated ion is called the **precursor ion**, as it is the peptide prior to fragmentation.

4. **Fragmentation**: Once isolated, the peptide ion is fragmented. Because there are many copies of this peptide which are simultaneously broken, we observe a variety of fragments.

5. **MS2 Scan**: After fragmentation, the instrument measures the fragment ions' m/z and intensity.

6. **Repeat**: After an MS2 scan has been completed in step 5, the mass spectrometer then proceeds with MS2 scans for the remaining ions selected in step 2 by repeating steps 3-5. Once finished with this short list, we return to step 1 and restart the cycle.


Let's think about that loop and what the LC-MS/MS data set would look like. If the number of ions selected in step 2 was always 3 ions, we would get data that looked like this:

**MS1**, MS2, MS2, MS2, **MS1**, MS2, MS2, MS2, **MS1**, MS2, MS2, MS2, **MS1**, ...

Keep in mind that this is an LC-MS/MS experiment. Thus, as time progresses, peptides will constantly be moving from the LC into the mass spectrometer. Each time we come back to the MS1 part of the cycle, the set of peptides to be measured will be slightly different. A new set of peptides will be selected in step 2, and so it will continue until the experiment has ended.

There are lots of optimizations to the duty cycle that are not discussed here. Our goal is to solidify the basic intuition for an LC-MS/MS experiment and how that gives information on many peptides in the sample through acquiring MS2 data.

## **Conclusion**

Our goal was to identify and measure the peptides within a sample. As we learned in **Lessons 1-3**, measuring mass of a peptide is not enough to identify its amino acid sequence. In this lesson, we introduced a new technique, tandem mass spectrometry, to break apart a peptide into pieces and then measure these pieces. The fragments of a peptide can be used to confidently identify a peptide (much more on this in [Lesson 5](https://colab.research.google.com/drive/1Weihp1oRIgiXaKwulyeGAcSAuUjb9ihl?usp=sharing)).

We also learned how to combine tandem mass spectrometry with liquid chromatography so that we can get information on many peptides in the sample.

## **Lesson 4 Terms**

* **fragmentation**: the process of breaking a peptide into two pieces; the break comes between two adjacent amino acids
* **precursor / precursor ion**: a peptide ion before fragmentation
* **fragments / fragment ions**: the pieces of the peptide ion after fragmentation
* **b-ion**: the front-half ion created from fragmenting a peptide
* **y-ion**: the back-half ion created from fragmenting a peptide
* **ion ladder**: sequence of b- and y-fragment ions; the difference between a peak in the ladder and the next peak in the ladder is one amino acid
* **LC-MS/MS**: an experiment that combines liquid chromatography, mass spectrometry and tandem mass spectrometry
* **tandem mass spectrometry**: measuring the fragment ions that were created via ion selection, ion isolation and fragmentation