<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

This notebook contains evertyhing related to recalibration of data.

## Recalibration after search

### Precursor mass calibration

Recalibration refers to the computational step where masses are recalibrated after a first search. The identified peptides are used to calculate the deviations of experimental masses to their theoretical masses. After recalibration, a second search with decreased precursor tolerance is performed. 

The recalibration is largely motivated by the software lock mass paper:

[Cox J, Michalski A, Mann M. Software lock mass by two-dimensional minimization of peptide mass errors. J Am Soc Mass Spectrom. 2011;22(8):1373-1380. doi:10.1007/s13361-011-0142-8](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3231580/)

Here, mass offsets are piecewise linearly approximated. The positions for approximation need to fulfill a number of criteria (e.g., a minimum number of samples and a minimum distance). The AlphaPept implementation is slightly modified by employing a more general `KNeighborsRegressor`-approach. In brief, the calibration is calculated for each point individually by estimating the deviation from its identified neighbors in n-dimensional space (e.g., retention time, mass, mobility).

More specifically, the algorithm consists of the following steps:

1. Outlier removal: We remove outliers from the identified peptides by only accepting identifications with a mass offset that is within n (default 3) standard deviations to the mean.
2. For each point, we perform a neighbors lookup of the next n (default 100) neighbors. For the neighbor's lookup we need to scale the axis, which is done with a transform function either absolute or relative.
3. Next, we perform a regression based on the neighbors to determine the mass offset. The contribution of each neighbor is weighted by their distance.

### Fragment mass calibration

The fragment mass calibration is based on the identified fragment_ions (i.e., b-hits and y-hits). For each hit, we calculate the offset to its theoretical mass. The correction is then applied by taking the median offset in ppm and applying it globally.

In [1]:
#| echo: false
#| output: asis
show_doc(remove_outliers)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L12){target="_blank" style="float:right; font-size:smaller"}

### remove_outliers

>      remove_outliers (df:pandas.core.frame.DataFrame, outlier_std:float)

Helper function to remove outliers from a dataframe.
Outliers are removed based on the precursor offset mass (prec_offset).
All values within x standard deviations to the median are kept.

Args:
    df (pd.DataFrame): Input dataframe that contains a prec_offset_ppm-column.
    outlier_std (float): Range of standard deviations to filter outliers

Raises:
    ValueError: An error if the column is not present in the dataframe.

Returns:
    pd.DataFrame: A dataframe w/o outliers.

In [2]:
#| echo: false
#| output: asis
show_doc(transform)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L42){target="_blank" style="float:right; font-size:smaller"}

### transform

>      transform (x:numpy.ndarray, column:str, scaling_dict:dict)

Helper function to transform an input array for neighbors lookup used for calibration

Note: The scaling_dict stores information about how scaling is applied and is defined in get_calibration

Relative transformation: Compare distances relatively, for mz that is ppm, for mobility %.
Absolute transformation: Compare distance absolute, for RT it is the timedelta.

An example definition is below:

scaling_dict = {}
scaling_dict['mz'] = ('relative', calib_mz_range/1e6)
scaling_dict['rt'] = ('absolute', calib_rt_range)
scaling_dict['mobility'] = ('relative', calib_mob_range)

Args:
    x (np.ndarray): Input array.
    column (str): String to lookup what scaling should be applied.
    scaling_dict (dict): Lookup dict to retrieve the scaling operation and factor for the column.

Raises:
    KeyError: An error if the column is not present in the dict.
    NotImplementedError: An error if the column is not present in the dict.

Returns:
    np.ndarray: A scaled array.

In [3]:
#| echo: false
#| output: asis
show_doc(kneighbors_calibration)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L88){target="_blank" style="float:right; font-size:smaller"}

### kneighbors_calibration

>      kneighbors_calibration (df:pandas.core.frame.DataFrame,
>                              features:pandas.core.frame.DataFrame, cols:list,
>                              target:str, scaling_dict:dict,
>                              calib_n_neighbors:int)

Calibration using a KNeighborsRegressor.
Input arrays from are transformed to be used with a nearest-neighbor approach.
Based on neighboring points a calibration is calculated for each input point.

Args:
    df (pd.DataFrame): Input dataframe that contains identified peptides (w/o outliers).
    features (pd.DataFrame): Features dataframe for which the masses are calibrated.
    cols (list): List of input columns for the calibration.
    target (str): Target column on which offset is calculated.
    scaling_dict (dict): A dictionary that contains how scaling operations are applied.
    calib_n_neighbors (int): Number of neighbors for calibration.

Returns:
    np.ndarray: A numpy array with calibrated masses.

In [4]:
#| echo: false
#| output: asis
show_doc(get_calibration)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L128){target="_blank" style="float:right; font-size:smaller"}

### get_calibration

>      get_calibration (df:pandas.core.frame.DataFrame,
>                       features:pandas.core.frame.DataFrame, file_name='',
>                       settings=None, outlier_std:float=3,
>                       calib_n_neighbors:int=100, calib_mz_range:int=100,
>                       calib_rt_range:float=0.5, calib_mob_range:float=0.3,
>                       **kwargs)

Wrapper function to get calibrated values for the precursor mass.

Args:
    df (pd.DataFrame): Input dataframe that contains identified peptides.
    features (pd.DataFrame): Features dataframe for which the masses are calibrated.
    outlier_std (float, optional): Range in standard deviations for outlier removal. Defaults to 3.
    calib_n_neighbors (int, optional): Number of neighbors used for regression. Defaults to 100. 
    calib_mz_range (int, optional): Scaling factor for mz range. Defaults to 20.
    calib_rt_range (float, optional): Scaling factor for rt_range. Defaults to 0.5.
    calib_mob_range (float, optional): Scaling factor for mobility range. Defaults to 0.3.
    **kwargs: Arbitrary keyword arguments so that settings can be passes as whole.

Returns:
    corrected_mass (np.ndarray): The calibrated mass
    y_hat_std (float): The standard deviation of the precursor offset after calibration

In [5]:
#| echo: false
#| output: asis
show_doc(calibrate_fragments_nn)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L347){target="_blank" style="float:right; font-size:smaller"}

### calibrate_fragments_nn

>      calibrate_fragments_nn (ms_file_, file_name, settings)

In [6]:
#| echo: false
#| output: asis
show_doc(save_precursor_calibration)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L293){target="_blank" style="float:right; font-size:smaller"}

### save_precursor_calibration

>      save_precursor_calibration (df, corrected, std_offset, file_name,
>                                  settings)

In [7]:
#| echo: false
#| output: asis
show_doc(save_fragment_calibration)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L240){target="_blank" style="float:right; font-size:smaller"}

### save_fragment_calibration

>      save_fragment_calibration (fragment_ions, corrected, std_offset,
>                                 file_name, settings)

In [8]:
#| echo: false
#| output: asis
show_doc(density_scatter)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L219){target="_blank" style="float:right; font-size:smaller"}

### density_scatter

>      density_scatter (x, y, ax=None, sort=True, bins=20, **kwargs)

Scatter plot colored by 2d histogram
Adapted from https://stackoverflow.com/questions/20105364/how-can-i-make-a-scatter-plot-colored-by-density-in-matplotlib

In [9]:
#| echo: false
#| output: asis
show_doc(chunks)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L214){target="_blank" style="float:right; font-size:smaller"}

### chunks

>      chunks (lst, n)

Yield successive n-sized chunks from lst.

In [10]:
#| echo: false
#| output: asis
show_doc(calibrate_hdf)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L462){target="_blank" style="float:right; font-size:smaller"}

### calibrate_hdf

>      calibrate_hdf (to_process:tuple, callback=None, parallel=True)

Wrapper function to get calibrate a hdf file when using the parallel executor.
The function loads the respective dataframes from the hdf, calls the calibration function and applies the offset.

Args:
    to_process (tuple): Tuple that contains the file index and the settings dictionary.
    callback ([type], optional): Placeholder for callback (unused).
    parallel (bool, optional): Placeholder for parallel usage (unused).

Returns:
    Union[str,bool]: Either True as boolean when calibration is successfull or the Error message as string.

#### Database calibration

Another way to calibrate the fragment and precursor masses is by directly comparing them to a previously generated theoretical mass database. Here, peaks in the distribution of databases are used to align the experimental masses.

In [11]:
#| echo: false
#| output: asis
show_doc(get_db_targets)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L551){target="_blank" style="float:right; font-size:smaller"}

### get_db_targets

>      get_db_targets (db_file_name:str, max_ppm:int=100,
>                      min_distance:float=0.5, ms_level:int=2)

Function to extract database targets for database-calibration.
Based on the FASTA database it finds masses that occur often. These will be used for calibration.

Args:
    db_file_name (str): Path to the database.
    max_ppm (int, optional): Maximum distance in ppm between two peaks. Defaults to 100.
    min_distance (float, optional): Minimum distance between two calibration peaks. Defaults to 0.5.
    ms_level (int, optional): MS-Level used for calibration, either precursors (1) or fragmasses (2). Defaults to 2.

Raises:
    ValueError: When ms_level is not valid.

Returns:
    np.ndarray: Numpy array with calibration masses.

In [12]:
#| echo: false
#| output: asis
show_doc(align_run_to_db)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L610){target="_blank" style="float:right; font-size:smaller"}

### align_run_to_db

>      align_run_to_db (ms_data_file_name:str, db_array:numpy.ndarray,
>                       max_ppm_distance:int=1000000, rt_step_size:float=0.1,
>                       plot_ppms:bool=False, ms_level:int=2)

Function align a run to it's theoretical FASTA database.

Args:
    ms_data_file_name (str): Path to the run.
    db_array (np.ndarray): Numpy array containing the database targets.
    max_ppm_distance (int, optional): Maximum distance in ppm. Defaults to 1000000.
    rt_step_size (float, optional): Stepsize for rt calibration. Defaults to 0.1.
    plot_ppms (bool, optional): Flag to indicate plotting. Defaults to False.
    ms_level (int, optional): ms_level for calibration. Defaults to 2.

Raises:
    ValueError: When ms_level is not valid.

Returns:
    np.ndarray: Estimated errors

In [13]:
#| echo: false
#| output: asis
show_doc(calibrate_fragments)

---

[source](https://github.com/mannlabs/alphapept/blob/master/alphapept/recalibration.py#L713){target="_blank" style="float:right; font-size:smaller"}

### calibrate_fragments

>      calibrate_fragments (db_file_name:str, ms_data_file_name:str,
>                           ms_level:int=2, write=True, plot_ppms=False)

Wrapper function to calibrate fragments.
Calibrated values are saved to corrected_fragment_mzs

Args:
    db_file_name (str): Path to database
    ms_data_file_name (str): Path to ms_data file
    ms_level (int, optional): MS-level for calibration. Defaults to 2.
    write (bool, optional): Boolean flag for test purposes to avoid writing to testfile. Defaults to True.
    plot_ppms (bool, optional):  Boolean flag to plot the calibration. Defaults to False.