# How to use the functions related to the MGF in the repository in the right way?

First, it's important to import some of the modules that have been built.

In [1]:
from src.utils import *
from scripts.process_mgf import *
from scripts.plot_mgf import *

Then add the path of the **.mgf** file. You can also check that there are no errors with the path.

In [2]:
mgf_data = r"/Users/carla/PycharmProjects/Mestrado/Transformer-Based-Models-for-Chemical-Fingerprint-Prediction/data/raw/cleaned_gnps_library.mgf"

path_check(mgf_data)

File found!


Then, to ensure that the iteration is taking place correctly over the spectra, you can use the information reading functions:

- validate_mgf_structure
- mgf_read_headers
- mgf_read_all

In [3]:
validate_mgf_structure(mgf_data)


Total de espectros encontrados: 459426
SCANS ausentes: 0
SPECTRUM_ID ausentes: 0
SCANS duplicados: 49616 -> ['1', '2', '3', '4', '5']...
SPECTRUM_ID duplicados: 0 -> []...


{'total_spectra': 459426,
 'missing_scans': 0,
 'missing_specids': 0,
 'duplicate_scans': ['1',
  '2',
  '3',
  '4',
  '5',
  '6',
  '7',
  '8',
  '9',
  '10',
  '11',
  '12',
  '13',
  '14',
  '15',
  '16',
  '17',
  '18',
  '19',
  '20',
  '21',
  '2742',
  '2847',
  '2241',
  '2208',
  '3352',
  '1397',
  '1636',
  '1670',
  '2211',
  '2197',
  '2203',
  '2219',
  '2672',
  '2618',
  '1770',
  '1867',
  '1823',
  '1883',
  '1747',
  '1725',
  '1262',
  '3447',
  '3231',
  '3230',
  '2424',
  '1106',
  '1105',
  '1111',
  '1100',
  '1759',
  '3012',
  '3034',
  '3607',
  '3606',
  '2528',
  '2616',
  '1952',
  '2550',
  '1182',
  '1183',
  '1964',
  '2491',
  '2524',
  '973',
  '1061',
  '1006',
  '1020',
  '1234',
  '3055',
  '3052',
  '2417',
  '1714',
  '1724',
  '2473',
  '2483',
  '2142',
  '2131',
  '1148',
  '1157',
  '1317',
  '1338',
  '1648',
  '1669',
  '1406',
  '1295',
  '170',
  '420',
  '507',
  '527',
  '519',
  '388',
  '477',
  '1845',
  '1856',
  '45',
  '80',
  '2

In [3]:
mgf_read_headers(mgf_data=mgf_data, num_spectra=3)

In [5]:
mgf_read_all(mgf_data=mgf_data, num_spectra=1)

And check if any of the spectra does not have a **valid ID**.

In [6]:
check_spectrum_ids(mgf_data)

All spectra have valid IDs


After the above checks, you can collect information on each spectrum with the function, **mgf_get_spectra**.

In [4]:
mgf_spect= mgf_get_spectra(mgf_data)

Once the data has been collected, it is also important to check the format of the data in the output. 

If it's more than 1 spectrum, it must be a dictionary. If it's only 1 spectrum, it should be a list.

In [8]:
type(mgf_spect)

list

After loading the data, we can do a little exploratory analysis of the data and metadata in the file.

First you can check the number of compounds and type of ionisation modes using **check_mgf_data**. It is also important to check whether the number of compounds matches the number of spectra found by the **validate_mgf_structure** function.

In [5]:
check_mgf_data(mgf_spect)

{'Total compounds': 459426,
 'Unique compounds': 49059,
 'Unknown compounds': 1486,
 'Positive ionization mode': 379652,
 'Negative ionization mode': 79774,
 'Unknown ionization mode': 0}

You can also see the distribution of spectra by compound using **plot_spectra_distribution**.

In [4]:
plot_spectra_distribution(spectra=mgf_spect, top_percent=95)

And some information about m/z and the number of peaks in the spectra using **check_mgf_spectra**.

In [4]:
check_mgf_spectra(mgf_spect)

After the exploratory analysis, you can plot the spectra.

If the data collected is a dictionary (1 spectrum), you should use the function **plot_spectrum** to visualize it.

In [9]:
plot_spectrum(mgf_spect)

If the data collected is a list (more than 1 spectrum), you should use the function **plot_spectra** to visualize it.

In [8]:
plot_spectra(mgf_spect)

In addition, it is possible to isolate and obtain the SMILES of the molecules within the .mgf file.

In [3]:
data = mgf_get_smiles(mgf_data, num_spectra=25)

In [None]:
"""
Deconvoluter test

for i, spec in enumerate(spectra):
    resultado = mgf_spectrum_deconvoluter(
        (i, spec),
        MIN_NUM_PEAKS=5,
        MAX_NUM_PEAKS=1000,
        NOISE_REMOVAL_THRESHOLD=0.01,
        allowedSpectralEntropy=5.0,
        mass_error=0.01,
        mz_vocabs=mz_vocabs,
        export_parameters={},
        logMINT=True
    )
"""