# DIA-Aspire-Rescore: Complete Workflow

This notebook demonstrates the end-to-end workflow for rescoring DIA-NN peptide-spectrum matches using deep learning models.

## Workflow Overview

1. **Data Loading** - Read PSM data from DIA-NN output
2. **MS2 Matching** - Match theoretical fragments with experimental spectra
3. **Model Finetuning** - Adapt pretrained MS2 and RT models to dataset
4. **Feature Generation** - Calculate MS2 similarity and RT prediction features
5. **Reporting** - Evaluate feature quality with target-decoy analysis


## Setup


In [None]:
import warnings
warnings.filterwarnings("ignore")

from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
from alpharaw import register_all_readers
from peptdeep.rescore.fdr import calc_fdr

from dia_aspire_rescore.io import read_diann2
from dia_aspire_rescore.config import FineTuneConfig
from dia_aspire_rescore.finetuning import FineTuner
from dia_aspire_rescore.features import MS2FeatureGenerator, RTFeatureGenerator
from dia_aspire_rescore.plot import plot_target_decoy_dist, plot_qvalues

register_all_readers()

output_dir = Path('../output/step_by_step')
output_dir.mkdir(parents=True, exist_ok=True)

RAW_FILE = '20200317_QE_HFX2_LC3_DIA_RA957_R01'
MS_FILE = f'../output/{RAW_FILE}.mzML.hdf5'


## 1. Data Loading

Load PSM (Peptide-Spectrum Match) data from DIA-NN output. The data includes target and decoy PSMs for FDR control.


In [None]:
psm_df_all = read_diann2("../../data/raw/SYS026_RA957/DDA_SYSMHC_bynam/lib-base-result-first-pass.parquet")
psm_df_all = psm_df_all[psm_df_all['raw_name'] == RAW_FILE].copy()
psm_df_all = psm_df_all.sort_values(by='nAA', ascending=True).reset_index(drop=True)

n_target = (psm_df_all['decoy'] == 0).sum()
n_decoy = (psm_df_all['decoy'] == 1).sum()
print(f"Loaded {len(psm_df_all)} PSMs from {RAW_FILE}")
print(f"Target: {n_target}, Decoy: {n_decoy}")


## 2. Model Finetuning

Finetune pretrained MS2 and RT models on high-confidence PSMs (FDR < 1%). The `train()` method handles FDR filtering, MS2 matching, and training internally.


In [None]:
config = FineTuneConfig(
    fdr_threshold=0.01,
    instrument='QE',
    nce=27,
    psm_num_to_train_ms2=8000,
    epoch_to_train_ms2=20,
    epoch_to_train_rt_ccs=25,
    train_verbose=True
)

finetuner = FineTuner(config)
finetuner.load_pretrained('generic')

# train() handles: FDR filtering -> MS2 matching -> train MS2 -> train RT
finetuner.train(psm_df_all, {RAW_FILE: MS_FILE}, ms_file_type='hdf5')

## 3. Feature Generation

### MS2 Features

Generate MS2 similarity features. The `MS2FeatureGenerator` handles MS2 matching internally.


In [None]:
ms2_generator = MS2FeatureGenerator(
    model_mgr=finetuner.model_manager,
    ms_files={RAW_FILE: MS_FILE},
    ms_file_type='hdf5',
)

psm_df_with_features = ms2_generator.generate(psm_df_all)
print(f"Generated {len(ms2_generator.feature_names)} MS2 features")

In [None]:
rt_generator = RTFeatureGenerator(model_mgr=finetuner.model_manager)
psm_df_with_features = rt_generator.generate(psm_df_with_features)
print(f"Generated {len(rt_generator.feature_names)} RT features: {rt_generator.feature_names}")

## 4. Feature Evaluation & Reporting

### Feature Statistics


In [None]:
psm_df_with_features[ms2_generator.feature_names].describe(percentiles=[0.01, 0.1, 0.5, 0.9, 0.99]).T

### Target-Decoy Analysis

Evaluate feature quality by examining target-decoy separation.


### RT Feature Distribution


In [None]:
psm_df_with_features[ms2_generator.feature_names].describe(percentiles=[.25, .5, .75, .9, .95, .99])

In [None]:
plot_target_decoy_dist(psm_df_with_features, metric='spc')

In [None]:
plot_target_decoy_dist(psm_df_with_features, metric='abs_rt_delta')

In [None]:
for feature in ms2_generator.feature_names + rt_generator.feature_names:
 
    psm_df_eval = calc_fdr(psm_df_with_features, score_column=feature)
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    plot_target_decoy_dist(psm_df_eval, feature, ax=axes[0])
    axes[0].set_title(f'{feature} - Target/Decoy Distribution')
    
    threshold = 0.1
    if psm_df_eval['fdr'].min() > 0.1:
        threshold = 0.5
    plot_qvalues(psm_df_eval['fdr'], threshold=threshold, ax=axes[1])
    axes[1].set_title(f'{feature} - Discoveries at FDR')
    plt.tight_layout()
    
    pdf_path = output_dir / f'{feature}.pdf'
    plt.savefig(pdf_path, bbox_inches='tight')
    plt.close()


**Key Outputs:**
- Feature evaluation plots: `output/step_by_step/*.pdf`

