# Metabolomics drift correction


Within-batch correction of metabolomics data. 
This script shows how to correct for instrumental drift based on pooled QC data samples.

### Load in example data

In [1]:
#%pip install acore

In [2]:
import acore
from acore import filter_metabolomics as fm

import pandas as pd
import os
import importlib

importlib.reload(acore)

<module 'acore' from '/Users/fcasc/Documents/03_CORE/acore/src/acore/__init__.py'>

In [3]:
from acore import drift_correction as dc

In [4]:
help(dc.loess_drift_correction)

Help on module acore.drift_correction.loess_drift_correction in acore.drift_correction:

NAME
    acore.drift_correction.loess_drift_correction - Functions for metabolomics drift correction.

FUNCTIONS
    filter_features_by_qc(df: pandas.core.frame.DataFrame, qc_cols: list, threshold: float = 0.5) -> pandas.core.frame.DataFrame
        Filter features in a DataFrame based on quality control (QC) completeness.
        
        This function removes rows (features) that do not meet a minimum number of valid
        (non-missing) QC values. The minimum number of required valid values is computed as
        `ceil(n_qc * (1 - threshold))`, where `n_qc` is the number of QC columns.
        
        :param pandas.DataFrame df:
            Input DataFrame containing feature data and QC columns.
        :param list qc_cols:
            List of column names corresponding to QC measurements.
        :param float threshold:
            Fraction (between 0 and 1) indicating the maximum allowed pro

In [5]:
help(dc.loess_drift_correction.run_drift_correction)

Help on function run_drift_correction in module acore.drift_correction.loess_drift_correction:

run_drift_correction(data, qc_cols, sample_cols, sample_order: pandas.core.frame.DataFrame, feature_name_col: str = None, filter_percent: float = None, qc_min_threshold: int = 4, print_logs=False, use_default=False)
    Perform QC-based drift correction across multiple features using
    LOESS regression and spline interpolation.
    
    For each feature:
    1. Extract QC intensities and corresponding injection order.
    2. Optionally filter features based on QC completeness.
    3. Compute QC relative standard deviation (RSD).
    4. If sufficient QC points exist, estimate a drift curve using
       `qc_rlsc_loess`, finding the best alpha smoothing span
       with leave-one-out cross validation (LOOCV).
    5. Normalize all intensities by dividing by the drift curve and
       scaling to the QC median.
    6. Record drift parameters and correction metadata.
    
    Parameters
    -----

## Load in data

In [17]:
df = pd.read_excel("../../example_data/aradopsis_seedling_lipids/kehelpannala_AnArabidopsisLipid_seedling.xlsx")
sample_order = pd.read_csv("../../example_data/aradopsis_seedling_lipids/seedling_art_sample_order.csv")

In [18]:
df

Unnamed: 0.1,Unnamed: 0,Lipid species,Adduct,m/z,RT (min),Characteristic fragments 4,Characteristic fragments 5,Characteristic fragments 6,Characteristic fragments 7,Characteristic fragments 8,...,Sd3-7,Sd3-8,Sd3-9,PBQC_Sd_1,PBQC_Sd_2,PBQC_Sd_3,PBQC_Sd_4,PBQC_Sd_5,PBQC_Sd_6,PBQC_CV
0,1,ADGGA 50:2; 16:0/16:0_18:2,[M+NH4]+,1024.797,18.141,415.272,337.273,313.254,239.236,,...,0.012,0.019,0.021,0.008,0.007,0.006,0.013,0.010,0.007,29.973
1,2,ADGGA 50:3; 16:0/16:0_18:3,[M+NH4]+,1022.778,17.097,239.240,313.276,335.261,415.268,,...,0.011,0.023,0.007,0.007,0.008,0.006,0.006,0.007,0.005,12.833
2,5,Cer-AP t34:1+O; t18:1/16:0+O,[M+H]+,570.508,9.596,298.271,262.258,280.264,534.490,552.499,...,0.404,0.381,0.320,0.126,0.114,0.128,0.110,0.106,0.111,8.022
3,6,Cer-AP t40:0+O; t18:0/22:0+O,[M+H]+,656.613,15.488,300.280,264.262,282.276,620.588,638.602,...,0.233,0.215,0.246,0.071,0.070,0.071,0.064,0.070,0.061,6.112
4,7,Cer-AP t40:1+O; t18:1/22:0+O,[M+H]+,654.607,14.716,298.272,262.252,280.267,316.283,600.560,...,2.631,2.664,2.097,0.724,0.874,0.835,0.830,0.923,0.787,8.303
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,363,TG 62:2; 24:0_18:1_20:1,[M+NH4]+,1016.953,23.249,717.673,689.643,631.566,,,...,0.127,0.160,0.122,0.064,0.071,0.011,0.070,0.067,0.056,40.767
333,364,TG 62:4; 26:0_18:2_18:2,[M+NH4]+,1012.935,22.827,715.647,599.497,,,,...,0.070,0.115,0.171,0.043,0.040,0.053,0.032,0.050,0.054,18.542
334,365,TG 62:5; 26:0_18:2_18:3,[M+NH4]+,1010.910,22.655,,,,,,...,0.044,0.169,0.117,0.020,0.092,0.020,0.017,0.059,0.032,75.269
335,366,TG 62:5; 26:1_18:2_18:2,[M+NH4]+,1010.908,22.580,,,,,,...,0.189,0.053,0.070,0.073,0.075,0.079,0.054,0.019,0.086,38.244


We also have (artificial) data that has the order in which our samples were run. This information is crucial for the drift corrrection algorithm.

In [19]:
sample_order.sort_values("Sample ID")

Unnamed: 0.1,Unnamed: 0,File Name,Sample ID
24,24,PBQC_Sd_1,1
0,0,Sd1_1,2
1,1,Sd1_2,3
2,2,Sd1_3,4
3,3,Sd1_4,5
25,25,PBQC_Sd_2,6
4,4,Sd1_5,7
5,5,Sd1_6,8
6,6,Sd2_1,9
7,7,Sd2_2,10


### Run drift correction with LOESS smoothing

We can now correct our data for experimental drift. 

With the acore LOESS drift-correction function, a LOESS (locally estimated regression) smoother is applied separately to the features in the data to model slow temporal trends, and the resulting smooth trend is used to correct the data.

Before the function estimation and correction, the data can filtered, to remove features that have too many missing values in the QC samples.

In [20]:
# First, we can create a dictionary for our sample names, ordering them into groups, to make the upcoming function call easier.

data_groups = {
    "SD1": ['Sd1_1','Sd1_2', 'Sd1_3', 'Sd1_4', 'Sd1_5', 'Sd1_6'],
    "SD2": ['Sd2_1', 'Sd2_2', 'Sd2_3', 'Sd2_4', 'Sd2_5', 'Sd2_6', 'Sd2_7', 'Sd2_8', 'Sd2_9'],
    "SD3": ['Sd3-1', 'Sd3-2', 'Sd3-3', 'Sd3-4', 'Sd3-5', 'Sd3-6', 'Sd3-7', 'Sd3-8', 'Sd3-9'],
    "QC" : [ 'PBQC_Sd_1', 'PBQC_Sd_2', 'PBQC_Sd_3', 'PBQC_Sd_4', 'PBQC_Sd_5', 'PBQC_Sd_6']
}

In [21]:
# Separate groups
sample_cols = data_groups["SD1"] + data_groups["SD2"] + data_groups[ "SD3"]
qc_cols = data_groups["QC"]

# Create a sub-df consisting only of the interesting columns, omitting metadata.
df_dc = df[sample_cols + qc_cols]

Now we can run the drift correction.

In [22]:
corrected_df, correction_info = dc.run_drift_correction(
    df, 
    qc_cols, 
    sample_cols, 
    sample_order=sample_order, 
    feature_name_col=None,
    filter_percent=0.5, 
    print_logs=True
)

### Explanation of the parameters chosen
# - feature_name_col = the name of the column containing feature names, if there is one. 
#   This information is used for logging and showing outputs, it's not required for the functioning of the method.
#   Here, there is no feature name column available, so "None" is used.
# - filter_percent =  the minimum percentage of values that must be present for this feature to be retained. 
#   If the percentage of non-missing is below this, the feature will be filtered out.
#   If this parameter is set to "None", no filtering will be done.
# - print_logs = whether there should be an output for the logs of the function. Like verbose. 

Corrected 0 with alpha 0.6.
Corrected 1 with alpha 0.9999999999999999.
Corrected 2 with alpha 0.9999999999999999.
Corrected 3 with alpha 0.9999999999999999.
Corrected 4 with alpha 0.9999999999999999.
Corrected 5 with alpha 0.9999999999999999.
Corrected 6 with alpha 0.6.
Corrected 7 with alpha 0.9999999999999999.
Corrected 8 with alpha 0.9999999999999999.
Corrected 9 with alpha 0.9999999999999999.
Corrected 10 with alpha 0.7999999999999999.
Corrected 11 with alpha 0.9999999999999999.
Corrected 12 with alpha 0.9999999999999999.
Flagging feature 13 due to too high QC RSD. (Feature not skipped)
RSD: 42.131018920330554
Corrected 13 with alpha 0.9999999999999999.
Flagging feature 14 due to too high QC RSD. (Feature not skipped)
RSD: 81.76681618000576
Corrected 14 with alpha 0.9999999999999999.
Corrected 15 with alpha 0.7999999999999999.
Corrected 16 with alpha 0.9999999999999999.
Corrected 17 with alpha 0.6.
Flagging feature 18 due to too high QC RSD. (Feature not skipped)
RSD: 78.2238466951

Now we can inspect our results. First, the corrected output dataframe.

In [23]:
corrected_df

Unnamed: 0.1,Unnamed: 0,Lipid species,Adduct,m/z,RT (min),Characteristic fragments 4,Characteristic fragments 5,Characteristic fragments 6,Characteristic fragments 7,Characteristic fragments 8,...,Sd3-8,Sd3-9,PBQC_Sd_1,PBQC_Sd_2,PBQC_Sd_3,PBQC_Sd_4,PBQC_Sd_5,PBQC_Sd_6,PBQC_CV,TempName
0,1,ADGGA 50:2; 16:0/16:0_18:2,[M+NH4]+,1024.797,18.141,415.272,337.273,313.254,239.236,,...,0.018,0.022,0.007,0.007,0.007,0.007,0.007,0.007,29.973,0
1,2,ADGGA 50:3; 16:0/16:0_18:3,[M+NH4]+,1022.778,17.097,239.240,313.276,335.261,415.268,,...,0.026,0.008,0.006,0.007,0.006,0.006,0.008,0.006,12.833,1
2,5,Cer-AP t34:1+O; t18:1/16:0+O,[M+H]+,570.508,9.596,298.271,262.258,280.264,534.490,552.499,...,0.398,0.336,0.114,0.106,0.122,0.108,0.108,0.117,8.022,2
3,6,Cer-AP t40:0+O; t18:0/22:0+O,[M+H]+,656.613,15.488,300.280,264.262,282.276,620.588,638.602,...,0.234,0.269,0.070,0.069,0.072,0.066,0.075,0.067,6.112,3
4,7,Cer-AP t40:1+O; t18:1/22:0+O,[M+H]+,654.607,14.716,298.272,262.252,280.267,316.283,600.560,...,2.647,2.086,0.779,0.907,0.831,0.811,0.911,0.784,8.303,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,363,TG 62:2; 24:0_18:1_20:1,[M+NH4]+,1016.953,23.249,717.673,689.643,631.566,,,...,0.163,0.122,0.068,0.084,0.014,0.088,0.073,0.056,40.767,332
333,364,TG 62:4; 26:0_18:2_18:2,[M+NH4]+,1012.935,22.827,715.647,599.497,,,,...,0.107,0.157,0.045,0.043,0.057,0.033,0.048,0.049,18.542,333
334,365,TG 62:5; 26:0_18:2_18:3,[M+NH4]+,1010.910,22.655,,,,,,...,0.115,0.078,0.011,0.055,0.013,0.012,0.042,0.021,75.269,334
335,366,TG 62:5; 26:1_18:2_18:2,[M+NH4]+,1010.908,22.580,,,,,,...,0.067,0.089,0.068,0.075,0.088,0.069,0.025,0.110,38.244,335


In [24]:
df

Unnamed: 0.1,Unnamed: 0,Lipid species,Adduct,m/z,RT (min),Characteristic fragments 4,Characteristic fragments 5,Characteristic fragments 6,Characteristic fragments 7,Characteristic fragments 8,...,Sd3-7,Sd3-8,Sd3-9,PBQC_Sd_1,PBQC_Sd_2,PBQC_Sd_3,PBQC_Sd_4,PBQC_Sd_5,PBQC_Sd_6,PBQC_CV
0,1,ADGGA 50:2; 16:0/16:0_18:2,[M+NH4]+,1024.797,18.141,415.272,337.273,313.254,239.236,,...,0.012,0.019,0.021,0.008,0.007,0.006,0.013,0.010,0.007,29.973
1,2,ADGGA 50:3; 16:0/16:0_18:3,[M+NH4]+,1022.778,17.097,239.240,313.276,335.261,415.268,,...,0.011,0.023,0.007,0.007,0.008,0.006,0.006,0.007,0.005,12.833
2,5,Cer-AP t34:1+O; t18:1/16:0+O,[M+H]+,570.508,9.596,298.271,262.258,280.264,534.490,552.499,...,0.404,0.381,0.320,0.126,0.114,0.128,0.110,0.106,0.111,8.022
3,6,Cer-AP t40:0+O; t18:0/22:0+O,[M+H]+,656.613,15.488,300.280,264.262,282.276,620.588,638.602,...,0.233,0.215,0.246,0.071,0.070,0.071,0.064,0.070,0.061,6.112
4,7,Cer-AP t40:1+O; t18:1/22:0+O,[M+H]+,654.607,14.716,298.272,262.252,280.267,316.283,600.560,...,2.631,2.664,2.097,0.724,0.874,0.835,0.830,0.923,0.787,8.303
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,363,TG 62:2; 24:0_18:1_20:1,[M+NH4]+,1016.953,23.249,717.673,689.643,631.566,,,...,0.127,0.160,0.122,0.064,0.071,0.011,0.070,0.067,0.056,40.767
333,364,TG 62:4; 26:0_18:2_18:2,[M+NH4]+,1012.935,22.827,715.647,599.497,,,,...,0.070,0.115,0.171,0.043,0.040,0.053,0.032,0.050,0.054,18.542
334,365,TG 62:5; 26:0_18:2_18:3,[M+NH4]+,1010.910,22.655,,,,,,...,0.044,0.169,0.117,0.020,0.092,0.020,0.017,0.059,0.032,75.269
335,366,TG 62:5; 26:1_18:2_18:2,[M+NH4]+,1010.908,22.580,,,,,,...,0.189,0.053,0.070,0.073,0.075,0.079,0.054,0.019,0.086,38.244


We can also look further into the correction_info object, to see the parameters that were chosen for each feature.

For example, let's check the parameters used for the 200th feature.

In [25]:
correction_info[200]

{'alpha': np.float64(0.9999999999999999),
 'drift_curve': [0.11006544927415765,
  0.10902530439341297,
  0.10803676752624497,
  0.10708091749354393,
  0.10519159321510337,
  0.10422027661114426,
  0.10320596212521285,
  0.10212972857819934,
  0.10097265479099393,
  0.09833775289421522,
  0.09680678911430461,
  0.09508871375352752,
  0.09314931232065644,
  0.09095437032446389,
  0.08846967327372238,
  0.08250944644837123,
  0.07905723011943926,
  0.07536188562731365,
  0.07148094090889955,
  0.06747192390110208,
  0.05929978476497768,
  0.05525171851046103,
  0.05130569171418163,
  0.04751923231304461,
  0.1111761233475889,
  0.1061388331162,
  0.0997158195844868,
  0.08566100667720437,
  0.0633923625408264,
  0.04394986824395513],
 'y_qc': [0.1097, 0.10574, 0.105218, 0.086961, 0.08665, 0.026189],
 'x_qc': [1, 6, 12, 19, 25, 30],
 'rsd_qc': np.float64(32.931697771711335),
 'median': np.float64(0.0960895),
 'y_all': array([0.78902715, 1.1869441 , 0.86874449, 0.62837778, 0.8779646 ,
     