# Teste dos dados do IDSL_Mint no modelo

- Este notebook serviu como preparação/processamente dos dados utilizados nos IDSL_Mint. 

- O objetivo é adaptar os dados para poderem ser consumidos pelo modelo construído. No notebook começamos já com os dados no formato **.mgf**. Para fazer a conversão do formato **.msp** para o formato **.mgf** utilizamos o script **msp_2_mgf.py**

## Leitura dos ficheiros e dos dados

In [1]:
from src.utils import *
from src.data.mgf_tools.mgf_get import *
from src.data.mgf_tools.mgf_checks import *

Começamos por ler os ficheiros do IDSL_Mint. No artigo e no zenodo eles são disponibilziados por modo de ionização

In [2]:
pos_ion_mode_path = 'datasets/idsl_mint_data/zenodo_2_lipidmaps_msp/LIPIDMSPs_POS.mgf'
neg_ion_mode_path = 'datasets/idsl_mint_data/zenodo_2_lipidmaps_msp/LIPIDMSPs_NEG.mgf'



pos_ion_mode = mgf_get_spectra(pos_ion_mode_path)
neg_ion_mode = mgf_get_spectra(neg_ion_mode_path)

Verificamos se está tudo bem com a leitura/dados

In [3]:
print(check_mgf_data(pos_ion_mode))
print(check_mgf_data(neg_ion_mode))

{'Total compounds': 106388, 'Unique compounds': 16104, 'Unknown compounds': 0, 'Positive ionization mode': 106388, 'Negative ionization mode': 0, 'Unknown ionization mode': 0}
{'Total compounds': 29008, 'Unique compounds': 11958, 'Unknown compounds': 0, 'Positive ionization mode': 0, 'Negative ionization mode': 29007, 'Unknown ionization mode': 1}


In [4]:
print(check_mgf_spectra(pos_ion_mode, percentile=95))
print(check_mgf_spectra(neg_ion_mode, percentile=95))

{'m/z range': (4.0, 2991.763184), 'peak count stats': {'min': 1, 'max': 8257, 'mean': 28.254838921947414, 'median': 9.0, 'percentile': {'25%': 4.0, '75%': 24.0, '90%': 68.0, '95%': 115.0, '99%': 234.0, '95': 115.0}}}
{'m/z range': (25.0032, 1999.419312), 'peak count stats': {'min': 1, 'max': 9203, 'mean': 25.347292169713693, 'median': 10.0, 'percentile': {'25%': 5.0, '75%': 20.0, '90%': 42.0, '95%': 62.54999999999927, '99%': 129.0, '95': 62.54999999999927}}}


Juntamos os dois ficheiros, garantindo que não se repetem IDs


In [5]:
for i, s in enumerate(pos_ion_mode):
    s['params']['spectrum_id'] = f"POS_{i}_{s['params']['spectrum_id']}"

for i, s in enumerate(neg_ion_mode):
    s['params']['spectrum_id'] = f"NEG_{i}_{s['params']['spectrum_id']}"


all_data = pos_ion_mode + neg_ion_mode
print(len(all_data))
print(check_mgf_data(all_data))

135396
{'Total compounds': 135396, 'Unique compounds': 25604, 'Unknown compounds': 0, 'Positive ionization mode': 106388, 'Negative ionization mode': 29007, 'Unknown ionization mode': 1}


E acabamos por guardar o ficheiro final

In [19]:
from pyteomics import mgf

output_path = 'datasets/idsl_mint_data/zenodo_2_lipidmaps_msp/idsl_mint.mgf'
mgf.write(all_data, output_path)

<pyteomics.auxiliary.file_helpers._file_obj at 0x7fe23fd438d0>

Utilizamos a função **mgf_get_smiles** para obter um dataframe com os SMILES relativos aos compostos dos espetros


In [6]:
smiles_df = mgf_get_smiles(all_data, as_dataframe=True)
len(smiles_df)

[11:36:02] ERROR: 

[11:36:03] ERROR: 

[11:36:03] ERROR: 

[11:36:03] ERROR: 

[11:36:03] ERROR: 



135391

Calculámos algumas informações importantes

In [7]:
smiles_df['canon_smiles'] = canonicalize_smiles(smiles_df['smiles'].tolist())
smiles_df = smiles_df.dropna(subset=['canon_smiles'])


max_num_peaks = calculate_max_num_peaks(all_data, percentile=95)
mz_vocabs = calculate_mz_vocabs(all_data)

max_seq_len = max_num_peaks + 1
vocab_size = len(mz_vocabs)

[11:36:21] SMILES Parse Error: syntax error while parsing: n/a
[11:36:21] SMILES Parse Error: Failed parsing SMILES 'n/a' for input: 'n/a'
[11:36:21] SMILES Parse Error: syntax error while parsing: n/a
[11:36:21] SMILES Parse Error: Failed parsing SMILES 'n/a' for input: 'n/a'
[11:36:21] SMILES Parse Error: syntax error while parsing: n/a
[11:36:21] SMILES Parse Error: Failed parsing SMILES 'n/a' for input: 'n/a'
[11:36:21] SMILES Parse Error: syntax error while parsing: n/a
[11:36:21] SMILES Parse Error: Failed parsing SMILES 'n/a' for input: 'n/a'
[11:36:21] SMILES Parse Error: syntax error while parsing: n/a
[11:36:21] SMILES Parse Error: Failed parsing SMILES 'n/a' for input: 'n/a'
[11:36:21] SMILES Parse Error: syntax error while parsing: n/a
[11:36:21] SMILES Parse Error: Failed parsing SMILES 'n/a' for input: 'n/a'
[11:36:21] SMILES Parse Error: syntax error while parsing: n/a
[11:36:21] SMILES Parse Error: Failed parsing SMILES 'n/a' for input: 'n/a'
[11:36:21] SMILES Parse Err

E guardamos num ficheiro para poder ser utilziado na pipeline

In [21]:
pipeline_config = {
        'max_num_peaks': max_num_peaks,
        'max_seq_len': max_seq_len,
        'mz_vocabs': mz_vocabs,
        'vocab_size': vocab_size
    }

config_path =  'pipeline_config.json'
with open(config_path, 'w') as f:
        import json
        json.dump(pipeline_config, f, indent=4)

In [17]:
smiles_df

Unnamed: 0,spectrum_id,smiles,canon_smiles
0,POS_0_CCMSLIB00000001635;,C[C@H]1/C=C/C=C(\C(=O)NC2=CC(=O)C3=C(C(=C(C(=C...,C/C1=C/C=C/[C@H](C)[C@H](O)[C@@H](C)[C@@H](O)[...
1,POS_1_CCMSLIB00000001637;,C[C@H]1/C=C/C=C(\C(=O)NC2=CC(=O)C3=C(C(=C(C(=C...,C/C1=C/C=C/[C@H](C)[C@H](O)[C@@H](C)[C@@H](O)[...
2,POS_2_CCMSLIB00000001727;,C[C@H](CCCC(C)C)[C@H]1CC[C@H]([C@]1(C)CCO)[C@@...,CC(C)CCC[C@@H](C)[C@H]1CC[C@@H]([C@@H]2C[C@@H]...
3,POS_3_CCMSLIB00000006870;,[H][C@@]12CCCN1C(=O)[C@]([H])(NC(=O)[C@@]([H])...,Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C...
4,POS_4_CCMSLIB00000006871;,[H][C@@]12CCCN1C(=O)[C@]([H])(NC(=O)[C@@]([H])...,Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C...
...,...,...,...
135386,NEG_29003_Bruker_HCD_library000986,CC1C(=O)OC2C(O)C34C5OC(=O)C3(OC3OC(=O)C(O)C34C...,CC1C(=O)OC2C(O)C34C5OC(=O)C3(OC3OC(=O)C(O)C34C...
135387,NEG_29004_Bruker_HCD_library000992,COc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H...,COc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H...
135388,NEG_29005_Bruker_HCD_library000993,COc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H...,COc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H...
135389,NEG_29006_Bruker_HCD_library000999,COc1ccc(C=CC(=O)c2ccc(O)cc2O)cc1,COc1ccc(C=CC(=O)c2ccc(O)cc2O)cc1


## Limpeza e tokenização dos espetros

Passamos para a limpeza/tokenização dos espetros através da **mgf_deconvoluter**

In [18]:
processed_spectra = mgf_deconvoluter(
        mgf_data=all_data,
        mz_vocabs=mz_vocabs,
        min_num_peaks=5,
        max_num_peaks=max_num_peaks,
        noise_rmv_threshold=0.01,
        mass_error=0.01,
        log=False)

In [19]:
len(processed_spectra)

85213

In [25]:
processed_spectra[1]

('POS_2_CCMSLIB00000001727;',
 4468,
 [4289,
  3189,
  3749,
  4109,
  4391,
  4319,
  3851,
  4035,
  3931,
  4161,
  3632,
  3965,
  4431,
  4475,
  2595,
  3418,
  3667,
  2450,
  1568,
  3742,
  4411,
  4102,
  4189,
  2268,
  1850,
  3832,
  3155,
  4168,
  2702,
  2009,
  2429,
  2201,
  2811,
  2971,
  3170,
  3992,
  1304,
  1708,
  1890],
 2.9932566694874083,
 array([0.27808872, 0.07424219, 0.0608785 , 0.06060001, 0.03719211,
        0.03607659, 0.03245047, 0.03021846, 0.0274278 , 0.02714869,
        0.02491554, 0.02324031, 0.02268183, 0.02156475, 0.01988883,
        0.01709474, 0.01597676, 0.01541769, 0.01485856, 0.01457898,
        0.01429938, 0.01318083, 0.01206201, 0.01010338, 0.00954358,
        0.0089837 , 0.00814369, 0.00786363, 0.00758355, 0.00702329,
        0.00646292, 0.00618268, 0.0059024 , 0.00562209, 0.00534174,
        0.00506135, 0.00478092, 0.00393934, 0.00337801]))

Ficamos apenas com 85213 espetros aprós a limpeza

## Geração de fingerprints atarvés dos SMILES

In [20]:
from deepmol.compound_featurization import MorganFingerprint
from deepmol.datasets import SmilesDataset

spectrum_ids = [spectrum_id for spectrum_id, *_ in processed_spectra]
filtered_smiles = smiles_df[smiles_df['spectrum_id'].isin(spectrum_ids)].copy()

smiles_list = filtered_smiles['canon_smiles'].tolist()
ids_list = filtered_smiles['spectrum_id'].tolist()

dataset = SmilesDataset(smiles=smiles_list, ids=ids_list)
dataset = MorganFingerprint().featurize(dataset)
dataset._y = dataset.X

No normalization for SPS. Feature removed!
No normalization for AvgIpc. Feature removed!
2026-02-14 11:45:37.306118: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-14 11:45:38.935017: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-14 11:45:38.935077: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-14 11:45:39.034494: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has alread

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Skipped loading modules with transformers dependency. No module named 'transformers'
cannot import name 'HuggingFaceModel' from 'deepchem.models.torch_models' (/home/cgomes/miniconda3/envs/transformer/lib/python3.11/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading some Jax models, missing a dependency. No module named 'jax'
MorganFingerprint: 100%|██████████| 80714/80714 [02:28<00:00, 544.82it/s]


# Splitting dos dados - 1

In [14]:
from deepmol.splitters import MultiTaskStratifiedSplitter


split = MultiTaskStratifiedSplitter()

train_dataset, val_dataset, _ = split.train_valid_test_split(dataset, frac_train=0.8,
                                                                 frac_valid=0.2, frac_test=0)

raw_splits = {'train': train_dataset.ids, 'val': val_dataset.ids, 'test': []}


In [20]:
print(f"Espetros para treino: {len(raw_splits['train'])}")
print(f"Espetros para validação: {len(raw_splits['val'])}")

Espetros para treino: 64371
Espetros para validação: 16343


In [21]:
from src.data.split_prep_tools.data_splitting import clean_splits

clean_splits_dict, cleaning_stats = clean_splits(raw_splits, filtered_smiles, remove_train_duplicates=False)

In [None]:
import pickle

split_pkl = 'split_ids.pkl'
with open('split_ids.pkl', 'wb') as f:
        pickle.dump(clean_splits_dict, f)

fp_df = pd.DataFrame(dataset.X, columns=[f'fp_{i}' for i in range(dataset.X.shape[1])])
fp_df['spectrum_id'] = dataset.ids
fingerprints_pkl = 'fingerprints.pkl'
fp_df.to_pickle(fingerprints_pkl)


In [40]:
from pathlib import Path

id_to_label = dict(zip(dataset.ids, dataset._y))

train_labels = np.array([id_to_label[spec_id] for spec_id in clean_splits_dict['train']])
val_labels = np.array([id_to_label[spec_id] for spec_id in clean_splits_dict['val']])
test_labels = np.array([id_to_label[spec_id] for spec_id in clean_splits_dict['test']])

if len(test_labels) == 0 and len(train_labels) > 0:
    n_features = train_labels.shape[1] # Deve ser 2048
    test_labels = np.zeros((0, n_features))

stats_df, table_styled = generate_data_stats(train_labels, test_labels, val_labels)

stats_csv_path = 'split_statistics.csv'
stats_html_path = 'split_statistics.html'

with open(stats_html_path, 'w') as f:
    f.write(table_styled.to_html())
    
stats_df.to_csv(stats_csv_path, index=False)

final_train_count = len(clean_splits_dict['train'])
final_val_count = len(clean_splits_dict['val'])
final_test_count = len(clean_splits_dict['test'])

total_final = final_train_count + final_val_count + final_test_count

summary_data = {
        "mode": "augmented_train",
        "split_seed": 2,

        "final_counts": {
            "train": final_train_count,
            "val": final_val_count,
            "test": final_test_count,
            "total": total_final
        },

        "final_fractions": {
            "train": round(final_train_count / total_final, 4),
            "val": round(final_val_count / total_final, 4),
            "test": round(final_test_count / total_final, 4)
        },

        "cleaning_impact": cleaning_stats
    }

summary_path = 'split_summary.json'
with open(summary_path, 'w') as f:
        json.dump(summary_data, f, indent=4)

if Path('src/data/artifacts/2/split_ids.pkl').exists() and Path('src/data/artifacts/2/fingerprints.pkl').exists():
    print(f'✅ Sucesso! Split IDs e Fingerprints guardados.')
else:
    print('⚠️ Aviso: Ficheiros não encontrados.')

✅ Sucesso! Split IDs e Fingerprints guardados.


Com o spliting usado no workflow normal não correu muito bem. Ficamos apenas com 100 exemplos no validation set

# Splitting dos dados - 2

In [26]:
from sklearn.model_selection import GroupShuffleSplit
from src.data.split_prep_tools.data_splitting import clean_splits

splitter = GroupShuffleSplit(n_splits=1, train_size=0.8, random_state=1)

train_idx, val_idx = next(splitter.split(ids_list, groups=smiles_list, y=smiles_list))


train_ids = [ids_list[i] for i in train_idx]
val_ids = [ids_list[i] for i in val_idx]


raw_splits = {
    'train': train_ids,
    'val': val_ids,
    'test': []
}

print(f"Split Feito!")
print(f"Espectros de Treino antes da limpeza: {len(train_ids)}")
print(f"Espectros de Validação antes da limpeza: {len(val_ids)}")


clean_splits_dict, stats = clean_splits(
    splits=raw_splits, 
    smiles_df=filtered_smiles, 
    remove_train_duplicates=False 
)

Split Feito!
Espectros de Treino antes da limpeza: 63540
Espectros de Validação antes da limpeza: 17174


In [27]:
print(stats)

{'original_train_count': 63540, 'cleaned_train_count': 63540, 'original_val_count': 17174, 'cleaned_val_count': 725, 'removed_from_val': 16449, 'original_test_count': 0, 'cleaned_test_count': 0, 'removed_from_test': 0}


In [28]:
import pickle

split_pkl = 'split_ids.pkl'
with open('split_ids.pkl', 'wb') as f:
        pickle.dump(clean_splits_dict, f)

fp_df = pd.DataFrame(dataset.X, columns=[f'fp_{i}' for i in range(dataset.X.shape[1])])
fp_df['spectrum_id'] = dataset.ids
fingerprints_pkl = 'fingerprints.pkl'
fp_df.to_pickle(fingerprints_pkl)


In [29]:
from pathlib import Path
import json

id_to_label = dict(zip(dataset.ids, dataset._y))

train_labels = np.array([id_to_label[spec_id] for spec_id in clean_splits_dict['train']])
val_labels = np.array([id_to_label[spec_id] for spec_id in clean_splits_dict['val']])
test_labels = np.array([id_to_label[spec_id] for spec_id in clean_splits_dict['test']])

if len(test_labels) == 0 and len(train_labels) > 0:
    n_features = train_labels.shape[1]
    test_labels = np.zeros((0, n_features))

stats_df, table_styled = generate_data_stats(train_labels, test_labels, val_labels)

stats_csv_path = 'split_statistics.csv'
stats_html_path = 'split_statistics.html'

with open(stats_html_path, 'w') as f:
    f.write(table_styled.to_html())
    
stats_df.to_csv(stats_csv_path, index=False)

final_train_count = len(clean_splits_dict['train'])
final_val_count = len(clean_splits_dict['val'])
final_test_count = len(clean_splits_dict['test'])

total_final = final_train_count + final_val_count + final_test_count

summary_data = {
        "mode": "augmented_train",
        "split_seed": 3,

        "final_counts": {
            "train": final_train_count,
            "val": final_val_count,
            "test": final_test_count,
            "total": total_final
        },

        "final_fractions": {
            "train": round(final_train_count / total_final, 4),
            "val": round(final_val_count / total_final, 4),
            "test": round(final_test_count / total_final, 4)
        },

        "cleaning_impact": stats
    }

summary_path = 'split_summary.json'
with open(summary_path, 'w') as f:
        json.dump(summary_data, f, indent=4)

if Path('src/data/artifacts/3/split_ids.pkl').exists() and Path('src/data/artifacts/3/fingerprints.pkl').exists():
    print(f'Sucesso! Split IDs e Fingerprints guardados.')
else:
    print('Aviso: Ficheiros não encontrados.')

Sucesso! Split IDs e Fingerprints guardados.
