# Transformer Model Data Preparation Workflow

This notebook documents the initial pipeline for mass spectrometry data preparation and preprocessing, which is essential for feeding our future Transformer model.

We start by loading and processing MGF files, applying a series of filters and transformations to ensure the quality and appropriate format of the data.

The first step involves validating the MGF file path (**path_check**) and loading the raw spectra. The **mgf_get_spectra** function is responsible for reading the MGF file and extracting each spectrum's data

In [1]:
from src.utils import *
from src.config import *
from src.mgf_tools.mgf_get import * 

In [2]:
mgf_data = r"/Users/carla/PycharmProjects/Mestrado/Transformer-Based-Models-for-Chemical-Fingerprint-Prediction/datasets/raw/cleaned_gnps_library.mgf"

path_check(mgf_data)

File found!


In [3]:
mgf_spect= mgf_get_spectra(mgf_data, num_spectra=5)

This is the core phase of transforming the raw data. The **mgf_deconvoluter** function iterates over each loaded spectrum, applying a series of cleaning and tokenization steps via the **mgf_spectrum_deconvoluter**

In [4]:
x = mgf_deconvoluter(mgf_data=mgf_spect, mz_vocabs=mz_vocabs, min_num_peaks=5, max_num_peaks=max_num_peaks, noise_rmv_threshold=0.01, mass_error=0.01, log=True)

[2] Rejected spectrum: 500


The **mgf_deconvoluter** function returns a list of tuples, where each tuple (spectrum_id, tokenized_mz, tokenized_precursor, intensities) represents a spectrum that has successfully passed through the entire preprocessing pipeline.

In [15]:
if len(x) > 0:
    spectrum_tuple = x[2]

    spectrum_id, tokenized_mz, tokenized_precursor, intensities = spectrum_tuple

    print(f"\nTokenised spectrum details:")
    print(f"Spectrum ID: {spectrum_id}")
    print(f"Number of m/z tokens: {len(tokenized_mz)}")
    print(f"Number of intensities: {len(intensities)}")
    print(f"Precursor token: {tokenized_precursor}")

else:
    print("No spectrum passed through the filters and was processed")


Tokenised spectrum details:
Spectrum ID: CCMSLIB00000001555
Number of m/z tokens: 15
Number of intensities: 15
Precursor token: 6641
