# Transformer Model Data Preparation Workflow

This notebook documents the initial pipeline for mass spectrometry data preparation and preprocessing, which is essential for feeding our future Transformer model.

We start by loading and processing MGF files, applying a series of filters and transformations to ensure the quality and appropriate format of the data.

The first step involves validating the MGF file path (**path_check**) and loading the raw spectra. The **mgf_get_spectra** function is responsible for reading the MGF file and extracting each spectrum's data

In [1]:
from src.utils import *
from src.data.flexible_dataloader import *
from src.model.transformer import EncoderTransformer

No normalization for SPS. Feature removed!
No normalization for AvgIpc. Feature removed!



Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Skipped loading modules with transformers dependency. No module named 'transformers'
cannot import name 'HuggingFaceModel' from 'deepchem.models.torch_models' (C:\Users\carla\miniconda3\envs\tese_d\Lib\site-packages\deepchem\models\torch_models\__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


In [2]:
mgf_data = r"/Users/carla/PycharmProjects/Mestrado/Transformer-Based-Models-for-Chemical-Fingerprint-Prediction/datasets/raw/cleaned_gnps_library.mgf"

path_check(mgf_data)

File found!


In [3]:
mgf_spect= mgf_get_spectra(mgf_data, num_spectra=10)

This is the core phase of transforming the raw data. The **mgf_deconvoluter** function iterates over each loaded spectrum, applying a series of cleaning and tokenization steps via the **mgf_spectrum_deconvoluter**

In [4]:
x = mgf_deconvoluter(mgf_data=mgf_spect, mz_vocabs=mz_vocabs, min_num_peaks=5, max_num_peaks=max_num_peaks, noise_rmv_threshold=0.01, mass_error=0.01, allowed_spectral_entropy=True, log=True)

[2] Rejected spectrum: 500
[5] Rejected after noise filtering: 3 peaks left


In [5]:
print(x)

[('CCMSLIB00000001547', 9805, [2883, 2945, 3163, 3187, 3235, 3243, 3388, 3429, 3460, 3601, 3608, 3639, 3648, 3672, 3680, 3741, 3748, 3818, 3832, 3896, 3930, 3961, 4034, 4128, 4267, 4352, 4423, 4453, 4467, 4542, 4551, 4565, 4633, 4689, 4701, 4743, 4751, 4760, 4779, 4790, 4822, 4862, 4872, 4902, 4947, 4974, 5020, 5033, 5042, 5092, 5112, 5123, 5140, 5149, 5200, 5222, 5280, 5300, 5336, 5370, 5382, 5397, 5471, 5531, 5550, 5563, 5589, 5603, 5631, 5639, 5654, 5703, 5710, 5742, 5761, 5799, 5811, 5825, 5842, 5972, 5984, 5994, 6013, 6083, 6124, 6212, 6220, 6230, 6242, 6373, 6393, 6402, 6451, 6505, 6561, 6571, 6584, 6624, 6673, 6792, 6810, 6850, 6906, 6922, 6933, 6953, 6961, 7085, 7098, 7107, 7131, 7146, 7223, 7231, 7245, 7274, 7348, 7434, 7465, 7523, 7606, 7635, 7643, 7683, 7693, 7704, 7864, 7951, 7962, 8056, 8106, 8113, 8121, 8203, 8224, 8275, 8294, 8303, 8311, 8351, 8375, 8385, 8446, 8464, 8471, 8504, 8514, 8523, 8646, 8653, 8674, 8683, 8761, 8824, 8872, 8883, 8911, 8925, 8946, 8980, 9004, 908

The **mgf_deconvoluter** function returns a list of tuples, where each tuple (spectrum_id, tokenized_mz, tokenized_precursor, intensities) represents a spectrum that has successfully passed through the entire preprocessing pipeline.

# Teste das dimensões / formato dos dados ao longo da pipeline do Transformer

In [2]:
# DataLoader

test_dataloader = data_loader_f(batch_size=2, num_spectra=10)

for batch in test_dataloader:
    mz_batch, int_batch, attention_mask_batch, batch_spectrum_ids, precursor_mask_batch = batch
    
    
    print(f' mz_batch: {mz_batch.shape}')
    print(f' int_batch: {int_batch.shape}')
    print(f' attention_mask_batch: {attention_mask_batch.shape}')
    print(f' precursor_mz_batch: {precursor_mask_batch.shape}')
    
    break
    
    # Esperado [batch_size, max_seq_len]

In [None]:
model = EncoderTransformer(vocab_size=vocab_size, d_model=d_model, nhead=4, num_layers=4, dropout_rate=0.1)

for batch in test_dataloader:
    mz_batch, int_batch, attention_mask_batch, batch_spectrum_ids, precursor_mask_batch = batch
    
    try:
        output = model(mz_batch, int_batch, attention_mask_batch)
        print(f" Output shape: {output.shape}")  # Esperado: [batch_size, 2048]
        print(f" Output range: {output.min().item():.4f} to {output.max().item():.4f}")
        break
    except Exception as e:
        print(f'Erro: {e}')
        break