# **Lesson 3 - LC-MS**

In this lesson, we'll learn about how we can measure all of the peptides in a complex sample. The focus of this lesson, and all lessons, is on data exploration and not the chemistry or engineering of the  mass spectrometer.

## **Assumptions**

It is assumed that the reader has completed the prior lessons and is familiar with basics of biology and introductory chemistry. From the previous lesson, it is critical that you understand the limitations of using m/z alone in identification.

## **Goals**

At the end of this lesson, you should be able to:
- Understand the basics of liquid chromatography and why it is used.
- Become familiar with liquid chromatography as a new dimension of measurement.

## **Context**

In [Lesson 2](https://colab.research.google.com/drive/15cwLXSNBbVSGe1tdFB-VikMSgGXdmkKp#scrollTo=3CWu0IR6LGti), we looked at peptides from the tryptic digestion of a single protein, carbonic anhydrase (CA). We finished Lesson 2 with the conclusion that **there are limitations to identifying a peptide with only m/z,** since many peptides can have the same m/z value. In a complex mixture, we will need some extra information to confidently identify a peptide.

## **Using this Tutorial**

This tutorial is designed to be interactive, and you are encouraged to change the code and explore. To do this, you'll need to save a copy of this so that you have editing permissions. Use `File->Save a copy in Drive` to make an editable copy for yourself. Colab notebooks consist of text cells (like this one) and code cells. You interact with the notebook by executing (running) the code cells by clicking the "play button" in each cell. You can also run all cells at once by using `Runtime->Run all`.

---

## **Part 1. Installation and Setup**

Before diving into the practical aspects of LC-MS, let's prepare our environment by installing the necessary Python packages and defining the necessary functions. These packages will enable us to analyze and visualize the data effectively. To apply the concepts we've learned, we'll be working with real data, so we'll be loading the data files into the Colab environment using `gdown`.

In this notebook, some code cells have been 'hidden' for brevity, like the next few below. You can recognize these because they just have a play button and a small text prompt `Show code`. In addition to the setup code, these include several functions that we will use throughout the lesson - some functions from previous lessons and some plotting code. You may want to look at this later in the lesson, but for now you can probably just click through. The first task is to establish the basic ideas behind **liquid chromatography**.

In [None]:
# @title Run this cell to set up the coding environment, including installing and loading necessary Python packages and loading in the data files.
%%capture
!pip install pyteomics
!pip install gdown
!pip install plotly

import pyteomics
from pyteomics import mzml, auxiliary
import gdown
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

# MS direct infusion file from lesson 2
!gdown 1E_ipDFM1u6bIKXPYq-ZzZ8xP1Mj_tHyM
di_mzml_path = '/content/CA_DirectInfusion_FullSpectrum_400_1500.mzML'

# LC-MS file
!gdown 1U6EoLTCcqAHwnCv_we4vGCcOBs_yNujt
lc_mzml_path = "/content/04-17-23_CA_Tryp_MS1_10min.mzML"

In [2]:
# @title Run this cell to declare a function that plots a mass spectrum (from Lesson 1).
def plot(spectrum, x_min = None, x_max = None, title = None):
    X = spectrum['m/z array']
    Y = spectrum['intensity array']
    Y_max = max(Y)
    Y_percentage = [(y/Y_max)*100 for y in Y]

    if not spectrum:
        x_min = spectrum['lowest observed m/z']
        x_max = spectrum['highest observed m/z']

    trace = go.Scatter(
        x = X,
        y = Y_percentage,
        mode = 'lines',
        name = 'Spectrum',
        line=dict(color='black')
    )

    layout = go.Layout(
        title = title,
        xaxis = dict(
            title = 'm/z',
            range = [x_min, x_max],
            linecolor='black',
            mirror=True
        ),
        yaxis = dict(
            title = 'Intensity (%)',
            range = [0, 105],
            linecolor='black',
            mirror=True
        ),
        plot_bgcolor='white',
        paper_bgcolor='white'
    )

    fig = go.Figure(data=[trace], layout=layout)
    fig.show()

In [3]:
# @title Run this cell to declare a function that gets a spectrum object (from Lesson 1).
def get_spectrum_object(mzml_path, scanNum):
  mzml = pyteomics.mzml.MzML(mzml_path)
  my_id = 'controllerType=0 controllerNumber=1 scan='+ str(scanNum)
  spectrum = mzml.get_by_id(my_id)
  return spectrum

In [4]:
# @title Run this cell to declare a function that gets the retention time for a spectrum.
# `spectrum` should be a spectrum object generated in get_spectrum_object
def get_retention_time(spectrum):
    return spectrum['scanList']['scan'][0]['scan start time']

In [5]:
# @title Run this cell to declare a function that plots multiple MS1 scans as subplots for easier viewing and comparison.

# Because we will be showing multiple specta, I will use a modified plot function
# to make viewing easier
def subplot(x_min=None, x_max=None, title=None, scans=None, num_cols=None, mzml_path=None):
    axis_style = dict(
        linecolor='black',
        showgrid=False,  # This will remove gridlines
        zerolinecolor='gray'
    )

    # This lets us either input a list of scans or a range of scans
    if isinstance(scans, tuple) and len(scans) == 2:
        scan_nums = list(range(scans[0], scans[1] + 1))
    elif isinstance(scans, list):
        scan_nums = scans

    # Format rows/columns
    num_scans = len(scan_nums)
    num_rows = -(-num_scans // num_cols)

    max_intensity = 0
    subplot_titles = []
    for scan_num in scan_nums:
        spec = get_spectrum_object(mzml_path, scan_num)
        current_max = max(spec['intensity array'])
        max_intensity = max(max_intensity, current_max)

        subplot_titles.append(f"RT: {get_retention_time(spec):.2f} min")

    fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=subplot_titles, vertical_spacing=0.1)

    for idx, scan_num in enumerate(scan_nums):
        spectrum = get_spectrum_object(mzml_path, scan_num)
        X = spectrum['m/z array']
        Y = spectrum['intensity array']
        trace = go.Scatter(x=X, y=Y, mode='lines', name='Spectrum',
                           showlegend=False, line=dict(color='black'))

        row = idx // num_cols + 1
        col = idx % num_cols + 1
        fig.add_trace(trace, row=row, col=col)
        fig.update_xaxes(title_text="m/z", range=[x_min, x_max], **axis_style, row=row, col=col)

        yaxis_title = "Intensity (ppm)" if col == 1 else None
        show_yaxis_ticks = True if col == 1 else False
        fig.update_yaxes(title_text=yaxis_title, range=[0, max_intensity], showticklabels=show_yaxis_ticks,
                         tickformat=".1e", **axis_style, row=row, col=col)

    fig.update_layout(plot_bgcolor='white', paper_bgcolor='white')
    fig.update_layout(title_text=title)
    fig.show()

In [6]:
# @title Run this cell to create a 3D plot for multiple MS1 scans.
def plot_3d(x_min=None, x_max=None, title=None, scan_range=None, mzml_path=None):
    traces_3d = []

    for scan_num in range(*scan_range):
        spectrum = get_spectrum_object(mzml_path, scan_num)
        x_vals = spectrum['m/z array']
        rt = float(get_retention_time(spectrum))
        y_vals = [rt] * len(x_vals)
        z_vals = spectrum['intensity array']

        trace_3d = go.Scatter3d(x=x_vals, y=y_vals, z=z_vals, mode='lines',
                                line=dict(color='black'),
                                name=f'Retention Time: {rt:.2f} min', showlegend=False)
        traces_3d.append(trace_3d)

    plot_title = title if title else 'LC-MS Spectrum'
    layout_3d = go.Layout(
        title=plot_title + " (3D View)",
        scene=dict(
            xaxis_title="m/z",
            xaxis_range=[x_min, x_max],
            yaxis_title="Retention Time (min)",
            zaxis_title="Intensity (ppm)",
            bgcolor='white',
            xaxis=dict(gridcolor='gray', zerolinecolor='gray'),
            yaxis=dict(gridcolor='gray', zerolinecolor='gray'),
            zaxis=dict(gridcolor='gray', zerolinecolor='gray')
        ),
        plot_bgcolor='white',
        paper_bgcolor='white',
        showlegend=False
    )

    fig_3d = go.Figure(data=traces_3d, layout=layout_3d)
    fig_3d.show()

In [7]:
# @title Run this cell to declare functions that plot an XIC.

# This makes the xic more readable
def clean_values(df):
    df_slim = df.sort_values('intensity')
    df_slim = df_slim.drop_duplicates(subset=["scan"], keep="last")
    df_slim = df_slim.sort_values('time')
    return df_slim

# Searching through all of our data from MS1 scans that have a mz score in a
#  specific range, and then keeping track of the intensity of the scan that found
#   that mz score, along with the time that it was found at.
def get_MS1_values(target_mz, peak_time, data, window = None):
    df = pd.DataFrame(columns=['scan', 'time', 'intensity', "mz"])
    tol = 0.1
    mz_min = target_mz - tol
    mz_max = target_mz + tol
    if not window : window = .05
    times = data.time[peak_time - window: peak_time + window]

    for spectra in times:
        # checking that we have an MS1 scan
        if spectra['ms level'] == 1:

            # getting the time
            time = (spectra['scanList']['scan'][0].get('scan start time'))

            # get scan number
            scanString = spectra['id']
            startSpot = scanString.find('scan=')
            scanNum = scanString[startSpot + 5:]

            # get intensity and mz
            intensity_array = spectra['intensity array']
            mz_array = spectra["m/z array"]

            # checking through all mz array for anything in our range of mz values
            for x in range(0, len(mz_array)):
                if mz_array[x] > mz_min and mz_array[x] < mz_max:
                    intensity = intensity_array[x]

                    # creating a new row and adding it into the df
                    row = {'scan': scanNum, 'time': time, 'intensity': intensity, 'mz': mz_array[x]}
                    # df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)  # depreciated
                    df = (pd.DataFrame([row]) if df.empty else pd.concat([df, pd.DataFrame([row])], ignore_index=True))
    cleaned_df = clean_values(df)

    return cleaned_df

def make_xic(mz, time, mz_data, time_window = None):
    xic = get_MS1_values(mz, time, mz_data, window = time_window)

    axis_style = dict(
        linecolor='black',
        zerolinecolor='gray'
    )

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=xic['time'], y=xic['intensity'], mode='lines+markers', line=dict(color='black')))

    fig.update_layout(
        title=f'XIC for mz {mz} across time',
        xaxis_title='Retention Time (min)',
        yaxis_title='Intensity',
        xaxis=axis_style,
        yaxis=axis_style,
        plot_bgcolor='white',
        paper_bgcolor='white'
    )

    fig.show()

## **Part 2. Liquid Chromatography (LC)**

Until now, we've used the **direct infusion** method to measure peptides. This means that we took our protein digest and introduced it *directly* into our mass spectrometer, leading to all the peptides in the sample being measured at once. By adding **liquid chromatography (LC)**, we can sort our peptides based on their chemical properties and only introduce a few at a time into the mass spectrometer. The property we use to sort peptides is **hydrophobicity**. By combining LC and MS we get two measurements of each peptide: hydrophobicity and mass.

Let's start with some intuition about how this works and why it is helpful. Imagine that you have a basket filled with all kinds of fruits that you want to measure. Some different kinds of fruits might have the same weight, like apples and oranges. So if your only tool is to measure weight, you might not be able to distinguish between all the different fruits. Thinking back to the peptides, some would be difficult (or impossible) to distinguish using m/z alone in our direct infusion data. This is where liquid chromatography (LC) can be used to help differentiate between peptides.

Think of LC as a machine that can separate these fruits based on their color in the order of the rainbow. At each step, you choose the next color and measure just what is in that color. For the first step, you gather red fruit - all the strawberries, cranberries, and apples -  and you weigh items in this group. In the next step, you gather and weigh orange fruit, then yellow, etc. Now each fruit now has both a weight and a step/color associated with it. By having two measurements, we are better able to distinguish between fruits that may weigh the same but are different colors (e.g. apples and oranges).

Instead of sorting fruits by color, LC sorts peptides based on their *hydrophobicity* - or their willingness to dissolve in water. The more hydrophobic a peptide is, the less likely it is to dissolve in water; instead, it prefers to dissolve in an organic solvent, like an acid.

We'll briefly describe an LC machine and its operation, trying to get across the main points and avoid technical detail. An LC machine has two parts, a thin tube packed with tiny beads and a pump that pushes liquid through this tube. As liquid emerges out the other end, it is sprayed into the mass spectrometer. At the beginning of the experiment, all peptides are loaded into the LC machine and the tube is filled with a solution that is mostly water. As peptides are hydrophobic, they cling to and bind the beads at the beginning of the experiment. Throughout an experiment, we vary the ratio of water and organic solvent (e.g. an acid) from mostly water, to mostly organic solvent. Like the fruit example above, this introduces a time element into our experiment. The less hydrophobic a peptide is, the quicker it leaves the LC machine and goes to the mass spectrometer to be measured. More hydrophobic peptides wait until there is more organic solvent, and take longer to leave the LC to be measured.

Some peptides might weigh the same, making it hard to tell them apart just by their m/z. However, using LC, we can separate them based on how hydrophobic they are (or how easily they dissolve). The time between when a peptide is loaded into the LC and when it exits to be measured in the MS is known as its **retention time (RT)**. You can think of it as: the peptide was retained in the LC until this time. A peptide measured at 16.5 minutes into the LC-MS experiment would be said to have a retention time of 16.5 min.

**In short, LC-MS lets you use hydrophobicity *and* mass to distinguish between different peptides in a complex mixture.** For simplicity, in the remainder of the lesson we will talk about these two dimensions as mass and retention time, and not the more complex chemistry of hydrophobicity.

If you want more explanation, a Khan Academy video that explains chromatography can be found [here](https://www.youtube.com/watch?v=SnbXQTTHGs4).

## **Part 3. Direct Infusion vs LC-MS**

Let's review the direct infusion spectrum from **Lesson 2** and then compare it to spectra from an LC-MS experiment of the same mixture. As a reminder, the sample being analyzed is the tryptic digest of protein carbonic anhydrase 2 (CA). There are many peptides resulting from this digested protein.

**Let's take a look at the direct infusion spectrum**:

In [8]:
# get our spectrum object
scan_num = 53
di_mzml = pyteomics.mzml.MzML(di_mzml_path)
my_id = 'controllerType=0 controllerNumber=1 scan='+ str(scan_num)

spectrum = di_mzml.get_by_id(my_id)

In [9]:
# This function is defined above in Part 1 - take a look at the code up
#   there to see what it is doing
plot(spectrum, title="MS1: Direct Infusion Spectrum")

That same sample was also analyzed with LC-MS. **Let's take a look at some spectra that were generated in an LC-MS experiment:**

In [10]:
# This function is defined above in Part 1 - take a look at the code up
#   there to see what it is doing
subplot(scans = [2636, 2702, 2919], num_cols = 3, title = 'Full MS1 Spectra w/ Varying RTs', mzml_path=lc_mzml_path)

When you compare the LC-MS spectrum to the direct injection spectrum, you'll notice:
  - The LC spectra are less cluttered than the direct infusion spectrum.
  - All LC-MS peaks can also be found in the direct infusion spectrum.
  - Some direct infusion peaks do not appear in an individual LC-MS spectrum.

When you compare the LC-MS spectra to each other, you'll notice:
  - Peptides with different retention times don't overlap.

The LC-MS spectra show some of the most abundant peptides in their time window. Even with some noise, they are clearer than the direct infusion spectrum.

With the addition of retention time (RT), each m/z peak now represents fewer peptides compared to the direct infusion data. This makes identification easier.

## **Part 4. Spectra Over Time**

Now, let's explicitly investigate the time dimension in more detail. Instead of big jumps in time like we saw in the three panels above, let's look at 10 consecutive measurements. We are also going to focus on a very small m/z range. The intense peak at m/z 1127.59 in the panel for RT 16.74 min is likely peptide `YGDFGTAAQQPDGLAVVGVFLK`. To inspect this, we zoomed into m/z 1124-1134, which captures this peak.

In [11]:
# Let's zoom in around m/z = 1127 and plot the scans with the RT close to where
# we saw the peptide (16.74 min). This will help us better understand what is going on.
subplot(x_min=1124, x_max=1134, scans = (2697,2706), num_cols = 5, mzml_path=lc_mzml_path)

**What do you notice?**

As time (RT) progresses, the intensity of the peptide increases, peaks, then decreases. The total time elapsed in these ten spectra is 0.06 minutes, or about 3-4 seconds. During that time, the peptide became soluble and left the LC instrument to get measured in the mass spectrometer.

If we combine the 10 MS1 spectra above into a 3D plot, we can see this same data in a slightly different way. Below is that plot, where:

- **m/z** is represented on the x-axis,
- **Intensity** is represented on the y-axis,
- and **RT** is represented on the z-axis.


Run the code cell below to produce a 3D plot for the peptide `YGDFGTAAQQPDGLAVVGVFLK`.

In [12]:
# This function is defined above in Part 1 - take a look at the code up
#   there to see what it is doing
plot_3d(x_min=1124, x_max=1134, scan_range=(2697, 2706), mzml_path=lc_mzml_path)

Using your mouse, you can rotate and play around with the graph. You can adjust the view on the 3D model to focus on just Intensity and m/z (spin the plot so that it looks like a 2D plot with Intensity on the y-axis and m/z on the x-axis). What do you notice?

- The isotopic envelope for `YGDFGTAAQQPDGLAVVGVFLK` from the MS1 is clearly visible.

Now, adjust the 3D model to focus on Intensity and Retention Time (RT). What do you notice?

- The intensity of the peptide in the spectrum increases and then decreases over time.

## **Part 5. Graphing an XIC**

To focus on just a peptide's behavior over time, we often create a graph plotting Intensity vs. RT. Here we specify just a single m/z and track it over time. This is called an **extracted ion chromatogram (XIC)**.

Run the code cell below to plot an XIC for the peptide `YGDFGTAAQQPDGLAVVGVFLK`.

In [13]:
# Choose the mz and retention time (in minutes) that you would like to draw an XIC for
mz = 1127.5916
time = 16.74
lc_mzml = pyteomics.mzml.MzML(lc_mzml_path)

# This function is defined above in Part 1 - take a look at the code up
#   there to see what it is doing
make_xic(mz, time, lc_mzml)

As you can see from the Intensity/RT graph above, the intensity (amount of the peptide being measured) of `YGDFGTAAQQPDGLAVVGVFLK` begins to increase around 16.71 minutes, peaks at about 16.75 minutes, then decreases until 16.78 minutes. This matches what we saw in the LC-MS spectra we plotted at different retention times. Because we have so many MS1 measurements for this precursor over time (each point on the plot represents a measurement), the shape of the XIC is fairly smooth (approximately Gaussian).

## **Conclusion**

In this lesson, we learned that LC sorts peptides by hydrophobicity, which helps us with differentiation. Unlike mass, the retention time for a peptide in a given solvent (acid) is not a simple or exact calculation. So although LC-MS helps us to see multiple peptides with the same m/z, it can't always be used to exactly identify the peptide. For example, if we had two peptides that have the same m/z (`VLDALDSIK` and `VADLLDISK`), we will notice two different peaks in an LC-MS experiment. But is `VADLLDISK` the earlier or later peak?

To identify a peptide, more information is needed. In [Lesson 4](https://colab.research.google.com/drive/13WEV58HpkY7f0kFi2BA5ia5p0XZCL3Cq?usp=sharing), we'll see how **Tandem-MS (MS/MS)** can help solve this problem.

## **Lesson 3 Terms**

* **direct infusion**: an MS experiment where the whole sample is introduced to the mass spectrometer at the same time
* **hydrophobicity**: the willingness of a molecule to dissolve in water; the more hydrophobic a molecule is, the less likely it is that it will dissolve in water
* **liquid chromatography (LC)**: a separation method that sorts peptides based on their *hydrophobicity*; this instrument is often coupled to a mass spectrometer
* **retention time (RT)**: the time at which a peptide leaves the LC instrument, measured from the start of the experiment
* **extracted ion chromatogram (XIC)**: a graph plotting *retention time (RT)* against intensity for a specific m/z value