# API pipeline

This notebook is designed to help you understand how to use the API (Transformer class) directly, without relying on pre-made scripts, providing greater flexibility.

First, you need to import some modules from the repository:


In [None]:
from src.utils import path_check
from src.utils import calculate_mz_vocabs, calculate_max_num_peaks
from src.data.split_prep_tools.data_splitting import preprocess_and_split
from src.data.mgf_tools.mgf_get import mgf_get_spectra
from src.data.data_loader import data_loader 
from src.models.Transformer import Transformer

And the path to the **.mgf** file:

In [None]:
mgf_path = "path_to_the_file"

path_check(mgf_path)

Then, you must **read the file** and **calculate** some of the **parameters** that will be needed later:

In [None]:
mgf_spectra = mgf_get_spectra(mgf_path)

In [None]:
max_num_peaks = calculate_max_num_peaks(mgf_spectra, percentile=95)
mz_vocabs = calculate_mz_vocabs(mgf_spectra)
max_seq_len = max_num_peaks + 1
vocab_size = len(mz_vocabs)

Then we move on to dividing the dataset into **training, validation and testing**:

The **seed** is a very important parameter in this pipeline and must be taken into account.

A function that has output and is called with seed 1 will store the results in a folder with the corresponding seed. 

In [None]:
splits = preprocess_and_split(mgf_path, seed=1)

After the splits, we call the **data_loader**:

In [None]:
loaders = data_loader(mgf_path=mgf_path, batch_size=16, seed=1, max_num_peaks=max_num_peaks, mz_vocabs=mz_vocabs)


Remember, the dataloader will fetch the splits from the seed provided to it. **(data_loader(...., seed = 1)) will load the data saved with seed 1**

After that, we can instantiate the model:

The seed used in this step will be the seed in which the method logs (.fit, .eval, and .predict) will be saved.

In [None]:
model = Transformer(seed=1,max_seq_len=max_seq_len,vocab_size=vocab_size,morgan_default_dim=2048,d_model=128,n_head=4,num_layers=4,dropout_rate=0.1)

And we can adjust the model to the training data:

In [None]:
best_model = model.fit(train_loader=loaders["train"], val_loader=loaders["val"], max_epochs=100)

And evaluate on the test set:

In [None]:
model.eval(test_loader=loaders["test"])

Once the model is trained, we can use it for its main function: predicting molecular fingerprints from new mass spectrometry data. To do this, we use the .predict() method.

In this notebook, we will use test_loader as an example, but you can create a DataLoader with any set of spectra for which you want to make predictions.

In [None]:
model.predict(loaders["test"], return_probabilities=False, save_results=False)
