# Transformer Model Data Preparation Workflow

This notebook documents the initial pipeline for mass spectrometry data preparation and preprocessing, which is essential for feeding our future Transformer model.

We start by loading and processing MGF files, applying a series of filters and transformations to ensure the quality and appropriate format of the data.

The first step involves validating the MGF file path (**path_check**) and loading the raw spectra. The **mgf_get_spectra** function is responsible for reading the MGF file and extracting each spectrum's data

In [1]:
from src.utils import *
from src.data.flexible_dataloader import *
from src.model.transformer import EncoderTransformer

In [2]:
mgf_data = r"/Users/carla/PycharmProjects/Mestrado/Transformer-Based-Models-for-Chemical-Fingerprint-Prediction/datasets/raw/cleaned_gnps_library.mgf"

path_check(mgf_data)

In [3]:
mgf_spect= mgf_get_spectra(mgf_data, num_spectra=10)

This is the core phase of transforming the raw data. The **mgf_deconvoluter** function iterates over each loaded spectrum, applying a series of cleaning and tokenization steps via the **mgf_spectrum_deconvoluter**

In [4]:
x = mgf_deconvoluter(mgf_data=mgf_spect, mz_vocabs=mz_vocabs, min_num_peaks=5, max_num_peaks=max_num_peaks, noise_rmv_threshold=0.01, mass_error=0.01, allowed_spectral_entropy=True, log=True)

In [5]:
print(x)

The **mgf_deconvoluter** function returns a list of tuples, where each tuple (spectrum_id, tokenized_mz, tokenized_precursor, intensities) represents a spectrum that has successfully passed through the entire preprocessing pipeline.

# Teste das dimensões / formato dos dados ao longo da pipeline do Transformer

In [2]:
# DataLoader

test_dataloader = data_loader_f(batch_size=2, num_spectra=10)

for batch in test_dataloader:
    mz_batch, int_batch, attention_mask_batch, batch_spectrum_ids, precursor_mask_batch = batch
    
    
    print(f' mz_batch: {mz_batch.shape}')
    print(f' int_batch: {int_batch.shape}')
    print(f' attention_mask_batch: {attention_mask_batch.shape}')
    print(f' precursor_mz_batch: {precursor_mask_batch.shape}')
    
    break
    
    # Esperado [batch_size, max_seq_len]

In [None]:
model = EncoderTransformer(vocab_size=vocab_size, d_model=d_model, nhead=4, num_layers=4, dropout_rate=0.1)

for batch in test_dataloader:
    mz_batch, int_batch, attention_mask_batch, batch_spectrum_ids, precursor_mask_batch = batch
    
    try:
        output = model(mz_batch, int_batch, attention_mask_batch)
        print(f" Output shape: {output.shape}")  # Esperado: [batch_size, 2048]
        print(f" Output range: {output.min().item():.4f} to {output.max().item():.4f}")
        break
    except Exception as e:
        print(f'Erro: {e}')
        break