In [8]:
import pandas as pd

In [10]:
df = pd.read_csv("../../data/production_ready_data/train/spectrs/MassSpecGym_fixed.csv")

In [11]:
df

Unnamed: 0,spec,smiles,extracted_spectral_info,InChI,Formula
0,MassSpecGymID0000001,CC(=O)N[C@@H](CC1=CC=CC=C1)C2=CC(=CC(=O)O2)OC,"{""cand_form"": ""C16H17NO4"", ""cand_ion"": ""[M+H]+...",InChI=1S/C16H17NO4/c1-11(18)17-14(8-12-6-4-3-5...,C16H17NO4
1,MassSpecGymID0000002,CC(=O)N[C@@H](CC1=CC=CC=C1)C2=CC(=CC(=O)O2)OC,"{""cand_form"": ""C16H17NO4"", ""cand_ion"": ""[M+H]+...",InChI=1S/C16H17NO4/c1-11(18)17-14(8-12-6-4-3-5...,C16H17NO4
2,MassSpecGymID0000003,CC(=O)N[C@@H](CC1=CC=CC=C1)C2=CC(=CC(=O)O2)OC,"{""cand_form"": ""C16H17NO4"", ""cand_ion"": ""[M+H]+...",InChI=1S/C16H17NO4/c1-11(18)17-14(8-12-6-4-3-5...,C16H17NO4
3,MassSpecGymID0000004,CC(=O)N[C@@H](CC1=CC=CC=C1)C2=CC(=CC(=O)O2)OC,"{""cand_form"": ""C16H17NO4"", ""cand_ion"": ""[M+H]+...",InChI=1S/C16H17NO4/c1-11(18)17-14(8-12-6-4-3-5...,C16H17NO4
4,MassSpecGymID0000005,CC(=O)N[C@@H](CC1=CC=CC=C1)C2=CC(=CC(=O)O2)OC,"{""cand_form"": ""C16H17NO4"", ""cand_ion"": ""[M+H]+...",InChI=1S/C16H17NO4/c1-11(18)17-14(8-12-6-4-3-5...,C16H17NO4
...,...,...,...,...,...
231099,MassSpecGymID0414168,CC[C@@H]1[C@H](/C=C(/C=C\C(=O)[C@@H](C[C@@H]([...,"{""cand_form"": ""C46H77NO17"", ""cand_ion"": ""[M+H]...",InChI=1S/C46H77NO17/c1-13-33-30(22-58-45-42(57...,C46H77NO17
231100,MassSpecGymID0414171,C[C@@]1([C@H]2C[C@H]3[C@@H](C(=O)C(=C([C@]3(C(...,"{""cand_form"": ""C22H23ClN2O8"", ""cand_ion"": ""[M+...",InChI=1S/C22H23ClN2O8/c1-21(32)7-6-8-15(25(2)3...,C22H23ClN2O8
231101,MassSpecGymID0414172,C[C@H]([C@@H]1CC[C@H]([C@H](O1)O[C@@H]2[C@H](C...,"{""cand_form"": ""C21H43N5O7"", ""cand_ion"": ""[M+H]...",InChI=1S/C21H43N5O7/c1-9(25-3)13-6-5-10(22)19(...,C21H43N5O7
231102,MassSpecGymID0414173,C[C@H]([C@@H]1CC[C@H]([C@H](O1)O[C@@H]2[C@H](C...,"{""cand_form"": ""C21H43N5O7"", ""cand_ion"": ""[M+H]...",InChI=1S/C21H43N5O7/c1-9(25-3)13-6-5-10(22)19(...,C21H43N5O7


# What modified base on Diffms approaches:
Reference files: 
1) /DiffMS/src/datasets/spec2mol_dataset.py
2) /DiffMS/src/mist/data/featurizers.py
3) /DiffMS/src/spec2mol_main.py

# SpectrumProcessor vs PeakFormula Comparison

The updated `SpectrumProcessor` class represents a significant refactoring of functionality from the `PeakFormula` class in the original code. Here are the key differences:

## Purpose and Integration
- **SpectrumProcessor**: Standalone class that processes raw spectral data directly from JSON
- **PeakFormula**: Integrated featurizer within a larger framework that works with `data.Spectra` objects

## Input Processing
- **SpectrumProcessor**: Takes raw JSON strings or dictionaries directly via `process_raw_spectrum`
- **PeakFormula**: Reads spectrum data from files in a predefined directory structure

## Architecture
- **SpectrumProcessor**: Self-contained with no inheritance
- **PeakFormula**: Inherits from `SpecFeaturizer` which inherits from `Featurizer`

## Key Implementation Differences

1. **Data Flow**:
   - `SpectrumProcessor` operates on raw data in memory
   - `PeakFormula` requires file system integration and file reading

2. **Method Structure**:
   - `SpectrumProcessor` has cleaner method names (`_get_peak_dict_from_raw`, `process_raw_spectrum`)
   - `PeakFormula` uses framework methods like `_featurize` and `collate_fn`

3. **Output Format**:
   - Both generate similar feature dictionaries but `PeakFormula` has additional metadata for the framework
   - `PeakFormula` includes collation functions for batching data in PyTorch

The `SpectrumProcessor` appears to be a focused extraction of the core spectrum processing functionality, making it more portable and independent from the larger framework. It's designed to work with raw data directly rather than relying on file system integration.

# FingerprintProcessor vs FingerprintFeaturizer Comparison

The new `FingerprintProcessor` class represents a streamlined version of the original `FingerprintFeaturizer` class with different interfaces and additional capabilities. Here's a detailed comparison:

## Purpose and Integration
- **FingerprintProcessor**: Self-contained class that processes raw spectral JSON data directly to generate fingerprints
- **FingerprintFeaturizer**: Integrated into a larger framework that works with `data.Mol` objects and includes PyTorch data handling

## Input Processing
- **FingerprintProcessor**: Accepts raw JSON with SMILES/InChI and creates molecules directly
- **FingerprintFeaturizer**: Expects pre-processed `data.Mol` objects and has no direct JSON handling

## Key Implementation Differences

1. **Data Flow**:
   - `FingerprintProcessor` extracts SMILES/InChI from JSON and creates RDKit molecules
   - `FingerprintFeaturizer` uses molecules provided by the larger framework

2. **Fingerprint Handling**:
   - Both support multiple fingerprint types (Morgan, MACCS)
   - Implementation of specific fingerprint methods is very similar

3. **New Features in FingerprintProcessor**:
   - Built-in fingerprint augmentation via bit flipping
   - Better error handling with `_empty_fingerprints` for when molecules can't be created

4. **Features in FingerprintFeaturizer not in FingerprintProcessor**:
   - Caching mechanism for fingerprints
   - Loading pre-computed fingerprints from files
   - PyTorch collation function

5. **Architecture**:
   - `FingerprintProcessor` is self-contained with no inheritance
   - `FingerprintFeaturizer` inherits from `MolFeaturizer` which inherits from `Featurizer`

The `FingerprintProcessor` represents a more independent implementation that works directly with raw data, while adding data augmentation capabilities and better error handling. However, it loses some of the framework integration features like PyTorch batching and distance calculations that were present in the original `FingerprintFeaturizer`.

# GraphProcessor vs GraphFeaturizer Comparison

The `GraphProcessor` class represents a significant enhancement and extension of the original `GraphFeaturizer` class. Here's a detailed comparison:

## Purpose and Integration
- **GraphProcessor**: Self-contained processor that converts raw spectral JSON directly into molecular graph representations
- **GraphFeaturizer**: Integrated into a larger framework that works with `data.Mol` objects from the larger system

## Input Processing
- **GraphProcessor**: Accepts raw JSON with SMILES/InChI and creates RDKit molecules directly
- **GraphFeaturizer**: Expects pre-processed `data.Mol` objects with molecule information already extracted

## Key Implementation Differences

1. **Graph Construction**:
   - Both construct PyTorch Geometric `Data` objects with similar node and edge representations
   - Similar atom type and bond type encoding approaches with one-hot encoding

2. **New Features in GraphProcessor**:
   - **Extended node features**: Option to add detailed atom properties (degree, charge, hybridization, etc.)
   - **Graph augmentation**: Ability to perturb node and edge features for data augmentation
   - **Error handling**: `_empty_graph()` method to handle cases when molecules can't be created
   - **Direct JSON processing**: Works directly with raw spectral data without requiring pre-processing

3. **Missing in GraphProcessor**:
   - **Batch collation**: `GraphFeaturizer` includes a `collate_fn` for PyTorch DataLoader integration

4. **Code Structure**:
   - `GraphProcessor` has more comprehensive and modular code organization
   - More extensive documentation and better error handling
   - More flexible input handling (SMILES or InChI)

5. **Architecture**:
   - `GraphProcessor` is self-contained with no inheritance
   - `GraphFeaturizer` inherits from `Featurizer` which enables caching behavior

The `GraphProcessor` represents a significantly enhanced version of the graph generation capabilities, with better flexibility, error handling, and feature representation capabilities. It works independently of the larger framework while adding more sophisticated graph generation and augmentation capabilities.

The major improvement is the ability to work directly with raw spectral data and to provide richer molecular representations with extended node features and augmentation options.

In [12]:
from processors.processors_diffms import *

In [14]:
# Initialize the processor
processor = SpectrumProcessor(
    max_peaks=50,
    inten_transform="float",
    cls_type="ms1",
    magma_aux_loss=True,
    magma_folder='../../data/raw/msg_diffms/magma_outputs/magma_tsv'
)

# For fingerprints
fingerprint_processor = FingerprintProcessor(
    fp_names=["morgan2048", "maccs"],
    augment_data=False
)

# For graphs
graph_processor = GraphProcessor(
    augment_data=False,
    add_node_features=True
)
# Process each row in the DataFrame
processed_rows = []

Found 231104 MAGMA files


In [None]:
from tqdm.notebook import tqdm


for idx, row in tqdm(df.iterrows(), total=df.shape[0], desc="Processing spectra"):
    
    fp_features = fingerprint_processor.process_raw_spectrum(
        raw_spectral_json=row,
        spec_id=row['spec'],
        train_mode=False
    )
    
    # print(fp_features)
    
    spec_features = processor.process_raw_spectrum(
        raw_spectral_json=row['extracted_spectral_info'],
        spec_id=row['spec'],
        train_mode=False
    )
    
    # print(spec_features)
    
    graph_data = graph_processor.process_raw_spectrum(
        raw_spectral_json=row,
        spec_id=row['spec'],
        train_mode=False
    )
    
    # print(graph_data)

    break

Processing spectra:   0%|          | 0/231104 [00:00<?, ?it/s]

{'fingerprints': {'morgan2048': array([0, 1, 0, ..., 0, 0, 0], dtype=int8), 'maccs': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0], dtype=int8)}, 'name': 'MassSpecGymID0000001', 'smiles': 'CC(=O)N[C@@H](CC1=CC=CC=C1)C2=CC(=CC(=O)O2)OC', 'inchi': 'InChI=1S/C16H17NO4/c1-11(18)17-14(8-12-6-4-3-5-7-12)15-9-13(20-2)10-16(19)21-15/h3-7,9-10,14H,8H2,1-2H3,(H,17,18)/t14-/m0/s1'}
{'peak_type': array([0, 0, 0, 0, 0, 0, 0, 0, 3]), 'form_vec': array([[ 6.,  4.,  0.,  0.,  0.,  0.,  0.,  0.,  0., 

