# Mass Recalibration in CoreMS
## Improving Mass Accuracy Through Calibration

This notebook demonstrates mass recalibration methods in CoreMS for improving mass accuracy in high-resolution mass spectrometry data.

### Overview
Mass recalibration improves the accuracy of mass measurements by:
1. Identifying calibrant ions in your data
2. Calculating mass error trends
3. Applying correction functions to all peaks

### Methods Covered
- **[Section 1](#Section-1---Basic-Manual-Mass-Recalibration)**: Manual mass recalibration with visual inspection
- **[Section 2](#Section-2----Segmented-Mass-Recalibration)**: Segmented mass recalibration for non-linear error trends
- **[Section 3](#Section-3---Automatic-Calibration-based-on-Reference-Mass-List)**: Automatic calibration using reference mass lists
- **[Section 4](#Section-4---Automatic-recalibration-based-on-assignments)**: Automatic recalibration based on molecular formula assignments

**Author:** Will Kew, william.kew@pnnl.gov  
**Original:** April 2023 | **Last Updated:** August 2024

### Section 1 - Basic Manual Mass Recalibration

First, lets load the necessary modules and our example mass spectrum.

In [None]:
# Import CoreMS modules for reading Bruker ICR data and setting MS Parameters
# as well as calibration and formula searches
from corems.transient.input.brukerSolarix import ReadBrukerSolarix
from corems.encapsulation.factory.parameters import MSParameters
from corems.mass_spectrum.calc.Calibration import MzDomainCalibration
from corems.molecular_id.search.molecularFormulaSearch import SearchMolecularFormulas



# Import a plotting library for visualisation
import matplotlib.pyplot as plt
from matplotlib import patches
import seaborn as sns

Next, we will define the location of our reference mass list for calibration, and set up the parameters for mass spectrum processing and molecular formula searching.

In [None]:
# Provide the file location 
datafile = '../../tests/tests_data/ftms/NEG_ESI_SRFA_Auto.d'
# Initiate the bruker reader object
bruker_reader = ReadBrukerSolarix(datafile)

# Set the noise thresholding method
MSParameters.mass_spectrum.noise_threshold_method = 'log'
MSParameters.mass_spectrum.noise_threshold_log_nsigma = 10

# Set the database connection string:
# Set to empty string if you want to use locally generated sqlite database
MSParameters.molecular_search.url_database = ""
# If you had a docker container with the database, you would set that to:
#         MSParameters.molecular_search.url_database = "postgresql+psycopg2://coremsappdb:coremsapppnnl@localhost:5432/coremsapp"

# Set some formula search settings
MSParameters.molecular_search.min_ppm_error  = -7.5
MSParameters.molecular_search.max_ppm_error = 7.5
MSParameters.molecular_search.usedAtoms['C'] = (1,90)
MSParameters.molecular_search.usedAtoms['H'] = (4,200)
MSParameters.molecular_search.usedAtoms['O'] = (1,23)
MSParameters.molecular_search.isProtonated = True



Now we will load in the mass spectrum data without any prior recalibration, andattempt to do formula assignment on these data before any recalibration

In [None]:
# Process the spectrum and return the mass spectrum object
mass_spectrum = bruker_reader.get_transient().get_mass_spectrum(plot_result=False,
                                          auto_process=True)
print("There were "+str(len(mass_spectrum))+' peaks detected.')

# Set the search database for molecular formula assignment
mass_spectrum.parameters.molecular_search.url_database = MSParameters.molecular_search.url_database

# Now search for molecular formulas
SearchMolecularFormulas(mass_spectrum, first_hit=True).run_worker_mass_spectrum()

In [None]:
# How many peaks were assigned
mass_spectrum.percentile_assigned()

In [None]:
# Lets visualise the assignments
# First export to a dataframe:
ms_df = mass_spectrum.to_dataframe()

#Now plot the m/z error vs m/z
g = sns.jointplot(x='m/z',y='m/z Error (ppm)',data=ms_df,
              color='k',
             joint_kws={'edgecolor':None,
                       'alpha':0.5})

e2 = patches.Circle((0.5, 0.55), radius=0.1,color='tab:red',
                    linewidth=2, fill=False, zorder=2,
                    transform=g.ax_joint.transAxes)

e3 = patches.Circle((0.55, 0.20), radius=0.1,color='tab:blue',
                    linewidth=2, fill=False, zorder=2,
                    transform=g.ax_joint.transAxes)

g.ax_joint.add_patch(e2)
g.ax_joint.add_patch(e3)


Clearly in the above figure there are several distributions of errors, but - assuming the data are of OK quality - only one can be correct, the others are incorrectly assigned. 

So, lets plot the Van Krevelen diagrams of those three distributions (approximately)

In [None]:
fig,axes = plt.subplots(ncols=2,figsize=(10,3),sharex=True,sharey=True)
#region 1 
r1 = ms_df[(ms_df['m/z Error (ppm)']>0)&(ms_df['m/z Error (ppm)']<2)]
axes[0].scatter(x=r1['O/C'],y=r1['H/C'],c='tab:red',alpha=0.5)
#region 2
r2 = ms_df[(ms_df['m/z Error (ppm)']<-2)&(ms_df['m/z Error (ppm)']>-6)]
axes[1].scatter(x=r2['O/C'],y=r2['H/C'],c='tab:blue',alpha=0.5)

for ax in axes:
    ax.set_xlabel('O/C')
axes[0].set_ylabel('H/C')
axes[0].set_title('Region 1')
axes[1].set_title('Region 2')

Clearly, region 1 is the 'correct' region, and so we can recalibrate the data within those constraints 

In [None]:
# Define the location of our reference mass list
refmasslist = '../../tests/tests_data/Hawkes_neg.ref'


# Define the mass calibration settings:
mass_spectrum.settings.calib_sn_threshold  = 20
mass_spectrum.settings.max_calib_ppm_error = 2
mass_spectrum.settings.min_calib_ppm_error = 0
mass_spectrum.settings.calib_pol_order = 2

MzDomainCalibration(mass_spectrum,refmasslist).run()

In [None]:
#Clear previous formula assignments
mass_spectrum.clear_molecular_formulas()

# Update threshold for assignments
mass_spectrum.molecular_search_settings.min_ppm_error  = -0.75
mass_spectrum.molecular_search_settings.max_ppm_error = 0.75

#Redo the CHO search
SearchMolecularFormulas(mass_spectrum, first_hit=True).run_worker_mass_spectrum()

mass_spectrum.percentile_assigned()

In [None]:
# Lets visualise the assignments
# First export to a dataframe:
ms_df = mass_spectrum.to_dataframe()

#Now plot the m/z error vs m/z
g = sns.jointplot(x='m/z',y='m/z Error (ppm)',data=ms_df,
              color='k',height=4,
             joint_kws={'edgecolor':None,
                       'alpha':0.5})

f = sns.jointplot(x='O/C',y='H/C',data=ms_df,
                 color='k',height=4,
             joint_kws={'edgecolor':None,
                       'alpha':0.5})

### Section 2 -- Segmented Mass Recalibration

In [None]:
#The output above in van krevelen space looks good, 
#but the errors are a little disperse

#Maybe we can take advantage of the new 'segmented' mass calibration.

# First, lets reload the mass spectrum object 

# Process the spectrum and return the mass spectrum object
mass_spectrum = bruker_reader.get_transient().get_mass_spectrum(plot_result=False,
                                          auto_process=True)



In [None]:
# Now lets do the segmented recalibration
# Define the location of our reference mass list
refmasslist = '../../tests/tests_data/Hawkes_neg.ref'

# Define the mass calibration settings:
mass_spectrum.settings.calib_sn_threshold  = 20
mass_spectrum.settings.max_calib_ppm_error = 2
mass_spectrum.settings.min_calib_ppm_error = 0
mass_spectrum.settings.calib_pol_order = 2
mass_spectrum.parameters.molecular_search.url_database = MSParameters.molecular_search.url_database

MzDomainCalibration(mass_spectrum,refmasslist,mzsegment=[0,375]).run()

MzDomainCalibration(mass_spectrum,refmasslist,mzsegment=[375,1000]).run()



In [None]:
# Update threshold for assignment
mass_spectrum.molecular_search_settings.min_ppm_error  = -0.75
mass_spectrum.molecular_search_settings.max_ppm_error = 0.75

#Redo the CHO search
SearchMolecularFormulas(mass_spectrum, first_hit=True).run_worker_mass_spectrum()

mass_spectrum.percentile_assigned()
# Lets visualise the assignments
# First export to a dataframe:
ms_df = mass_spectrum.to_dataframe()

#Now plot the m/z error vs m/z
g = sns.jointplot(x='m/z',y='m/z Error (ppm)',data=ms_df,
              color='k',height=4,
             joint_kws={'edgecolor':None,
                       'alpha':0.5})

f = sns.jointplot(x='O/C',y='H/C',data=ms_df,
                 color='k',height=4,
             joint_kws={'edgecolor':None,
                       'alpha':0.5})

### Section 3 - Automatic Calibration based on Reference Mass List

In [None]:
# First, lets reload the mass spectrum object 
# Process the spectrum and return the mass spectrum object
mass_spectrum = bruker_reader.get_transient().get_mass_spectrum(plot_result=False,
                                          auto_process=True)
mass_spectrum.parameters.molecular_search.url_database = MSParameters.molecular_search.url_database

In [None]:
# Now lets define to use the new 'merged' calibration method:
# Note, the old (original) method is available with 'legacy' call.
MSParameters.mass_spectrum.calibration_ref_match_method = 'merged'
MSParameters.mass_spectrum.calibration_ref_match_tolerance = 0.003
MzDomainCalibration(mass_spectrum,refmasslist).run()

In [None]:
# And now assign the recalibrated data
# Update threshold for assignment
mass_spectrum.molecular_search_settings.min_ppm_error  = -0.75
mass_spectrum.molecular_search_settings.max_ppm_error = 0.75

#Redo the CHO search
SearchMolecularFormulas(mass_spectrum, first_hit=True).run_worker_mass_spectrum()

mass_spectrum.percentile_assigned()
# Lets visualise the assignments
# First export to a dataframe:
ms_df = mass_spectrum.to_dataframe()

#Now plot the m/z error vs m/z
g = sns.jointplot(x='m/z',y='m/z Error (ppm)',data=ms_df,
              color='k',height=4,
             joint_kws={'edgecolor':None,
                       'alpha':0.5})

f = sns.jointplot(x='O/C',y='H/C',data=ms_df,
                 color='k',height=4,
             joint_kws={'edgecolor':None,
                       'alpha':0.5})

### Section 4 - Automatic recalibration based on assignments

In [None]:
# Again, reload the dataset
# First, lets reload the mass spectrum object 
# Process the spectrum and return the mass spectrum object
mass_spectrum = bruker_reader.get_transient().get_mass_spectrum(plot_result=False,
                                          auto_process=True)
mass_spectrum.parameters.molecular_search.url_database = MSParameters.molecular_search.url_database
mass_spectrum.parameters.molecular_search.db_jobs = 1

In [None]:
# Import the class for automatic recalibration based on assignments
from corems.mass_spectrum.calc.AutoRecalibration import HighResRecalibration


In [None]:
# This class is initialised with a few options, you must pass the mass spectrum object first. 
# plot= true - will plot the models for the automatic determination of the error boundaries. 
# docker - uses docker for formula assignment (else, uses sqlite). 
# ppmFWHMprior - estimate the spread of mass errors in the raw data. 
# ppmRangeprior - estimate possible range of mass error medians (e.g. 15 = +-7.5). 
autorecaler = HighResRecalibration(mass_spectrum, plot = True, docker = True, ppmFWHMprior = 3, ppmRangeprior = 15)

In [None]:
auto_cal_boundaries = autorecaler.determine_error_boundaries()

In [None]:
print(f'Raw error center: {auto_cal_boundaries[0]:.2f} ppm, \nRaw error standard FWHM: {auto_cal_boundaries[1]:.2f} ppm, \nSuggested bounds based on mean error +- fwhm: {auto_cal_boundaries[2][0]:.2f} to {auto_cal_boundaries[2][1]:.2f} ppm')

In [None]:
# Now recalibrate the data based on those bounds:
# Define the mass calibration settings:
# Use the original/legacy method for matching reference masses. 
MSParameters.mass_spectrum.calibration_ref_match_method = 'legacy'
mass_spectrum.settings.calib_sn_threshold  = 20
mass_spectrum.settings.max_calib_ppm_error = max(auto_cal_boundaries[2])
mass_spectrum.settings.min_calib_ppm_error = min(auto_cal_boundaries[2])
mass_spectrum.settings.calib_pol_order = 2

MzDomainCalibration(mass_spectrum,refmasslist).run()

In [None]:
# And now assign the recalibrated data
# Update threshold for assignment
mass_spectrum.molecular_search_settings.min_ppm_error  = -0.75
mass_spectrum.molecular_search_settings.max_ppm_error = 0.75
mass_spectrum.molecular_search_settings.usedAtoms['C'] = (1,90)
mass_spectrum.molecular_search_settings.usedAtoms['H'] = (4,200)
mass_spectrum.molecular_search_settings.usedAtoms['O'] = (1,23)
mass_spectrum.molecular_search_settings.isProtonated = True

#Redo the CHO search
SearchMolecularFormulas(mass_spectrum, first_hit=True).run_worker_mass_spectrum()

mass_spectrum.percentile_assigned()
# Lets visualise the assignments
# First export to a dataframe:
ms_df = mass_spectrum.to_dataframe()

#Now plot the m/z error vs m/z
g = sns.jointplot(x='m/z',y='m/z Error (ppm)',data=ms_df,
              color='k',height=4,
             joint_kws={'edgecolor':None,
                       'alpha':0.5})

f = sns.jointplot(x='O/C',y='H/C',data=ms_df,
                 color='k',height=4,
             joint_kws={'edgecolor':None,
                       'alpha':0.5})