# MaldiAMRKit - Spectral Alignment

This notebook covers spectral alignment (warping) methods to correct for mass calibration drift.

## Import Libraries

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from maldiamrkit import MaldiSet
from maldiamrkit.alignment import RawWarping, Warping, create_raw_input

## Load Dataset

In [2]:
data = MaldiSet.from_directory(
    "../data/",
    "../data/metadata/metadata.csv",
    aggregate_by=dict(antibiotics="Drug"),
)
X = data.X
y = data.y["Drug"].map({"S": 0, "I": 1, "R": 1})

print(f"Features shape: {X.shape}")

Features shape: (29, 6000)


## Warping Methods

MaldiAMRKit supports multiple alignment methods:

- **shift**: Global median shift (fast, simple)
- **linear**: Least-squares linear transformation
- **piecewise**: Local shifts across spectrum segments (most flexible)
- **dtw**: Dynamic Time Warping (best for non-linear drift, slowest)

In [3]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Shift method (fastest)
pipe_shift = Pipeline(
    [
        ("warp", Warping(method="shift")),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

scores = cross_val_score(pipe_shift, X, y, cv=cv, scoring="roc_auc")
print(f"Shift - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

Shift - CV ROC AUC: 0.400 +/- 0.255


In [4]:
# Linear method
pipe_linear = Pipeline(
    [
        ("warp", Warping(method="linear")),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

scores = cross_val_score(pipe_linear, X, y, cv=cv, scoring="roc_auc")
print(f"Linear - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

Linear - CV ROC AUC: 0.400 +/- 0.289


In [5]:
# Piecewise method (often best trade-off)
pipe_piecewise = Pipeline(
    [
        ("warp", Warping(method="piecewise", n_segments=10)),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

scores = cross_val_score(pipe_piecewise, X, y, cv=cv, scoring="roc_auc")
print(f"Piecewise - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

Piecewise - CV ROC AUC: 0.400 +/- 0.289


## Alignment Quality Assessment

Use `get_alignment_quality()` to measure how well spectra were aligned to the reference.

In [6]:
# Fit warping and check alignment quality
warper = Warping(method="piecewise", n_segments=10)
warper.fit(X)
X_aligned = warper.transform(X)

# Get alignment quality metrics
quality = warper.get_alignment_quality(X, X_aligned)
print(f"Mean correlation improvement: {quality['improvement'].mean():.4f}")
quality.head()

Mean correlation improvement: 0.0056


Unnamed: 0,correlation_before,correlation_after,improvement,rmse_before,rmse_after
10s,0.850781,0.850781,0.0,0.000137,0.000137
11s,0.854397,0.854397,0.0,0.000185,0.000185
12s,0.89836,0.89836,0.0,0.000192,0.000192
13s,0.817404,0.817404,0.0,0.00024,0.00024
14s,0.825112,0.825087,-2.4e-05,0.000177,0.000177


## Raw Spectra Warping

`RawWarping` performs alignment at full m/z resolution (before binning) for higher precision. It loads raw spectra files during fit/transform and outputs properly binned data.

**Key workflow:**
1. Use `create_raw_input()` to create input DataFrame with file paths
2. Pass this DataFrame to `RawWarping` in your pipeline
3. Get properly binned, aligned spectra as output

This design makes `RawWarping` fully compatible with sklearn pipelines.

In [7]:
# Create input DataFrame from raw spectra directory
X_raw = create_raw_input("../data/")
print(f"Input DataFrame shape: {X_raw.shape}")
print(f"Columns: {X_raw.columns.tolist()}")
X_raw.head()

Input DataFrame shape: (29, 1)
Columns: ['path']


Unnamed: 0,path
10s,../data/10s.txt
11s,../data/11s.txt
12s,../data/12s.txt
13s,../data/13s.txt
14s,../data/14s.txt


In [8]:
# RawWarping in a pipeline - outputs binned spectra
raw_warper = RawWarping(
    method="piecewise",
    bin_width=3,
    max_shift_da=10.0,
    n_segments=5,
)

# Fit and transform - loads raw files, warps at full resolution, bins output
raw_warper.fit(X_raw)
X_raw_aligned = raw_warper.transform(X_raw)
print(f"Input shape:  {X_raw.shape} (single 'path' column)")
print(f"Output shape: {X_raw_aligned.shape} (binned spectra)")
print(
    f"Output columns are m/z bin starting points: {X_raw_aligned.columns[:5].tolist()}..."
)

Input shape:  (29, 1) (single 'path' column)
Output shape: (29, 6000) (binned spectra)
Output columns are m/z bin starting points: ['2000', '2003', '2006', '2009', '2012']...


## Parallelization

Use `n_jobs` parameter to enable parallel processing for faster computation.

In [9]:
# Parallel warping (use all cores)
warper_parallel = Warping(method="piecewise", n_segments=10, n_jobs=-1)
warper_parallel.fit(X)
X_aligned_parallel = warper_parallel.transform(X)
print(f"Aligned {len(X)} spectra")

Aligned 29 spectra


## RawWarping in sklearn Pipeline

Since `RawWarping` accepts a path-based DataFrame and outputs binned spectra, it integrates seamlessly into sklearn pipelines.

In [10]:
# Full pipeline: raw spectra -> alignment -> scaling -> classification
pipe_raw = Pipeline(
    [
        ("warp", RawWarping(method="piecewise", bin_width=3, n_segments=5)),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

# Cross-validation with RawWarping pipeline
# Note: X_raw contains file paths, y contains labels
scores = cross_val_score(pipe_raw, X_raw, y, cv=cv, scoring="roc_auc")
print(f"RawWarping Pipeline - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

RawWarping Pipeline - CV ROC AUC: 0.375 +/- 0.250
