
#### Understanding the data_processing Section

The data_processing section of your YAML configuration defines how molecular data is featurized for use in your machine learning models. This section specifies which atomic and bond features are extracted from molecular structures and how they are processed.

Here’s the data_processing section from YAML file:

```yaml
data_processing:
  atom_config:
    feature_attributes:
      atom_symbol:
        include_other: true
        top_n_atoms: 42
    features:
      aromatic: true
      atom_symbol: true
      default_valence: true
      formal_charge: true
      hybridization: true
      hydrogen_count: true
      ring_size: true
      total_valence: true
  bond_config:
    features:
      bond_type: true
      conjugated: true
      in_ring: true
      stereochemistry: false
```
Overview

	•	atom_config: Specifies the configuration for atomic features.
	•	bond_config: Specifies the configuration for bond features.

These configurations are used by the MoleculeFeaturizer class to generate graph representations of molecules, where nodes represent atoms and edges represent bonds.

#### Atom Features (atom_config)

Feature Attributes

The feature_attributes subsection allows you to set specific parameters for certain features. In this case:
```
feature_attributes:
  atom_symbol:
    include_other: true
    top_n_atoms: 42
```
	•	atom_symbol:
	•	include_other: If true, includes an “Unknown” category for atom symbols not in the top N atoms.
	•	top_n_atoms: Specifies the number of most common atom types to consider.

Features

The features subsection lists boolean flags indicating which features to extract for each atom:
```
features:
  aromatic: true
  atom_symbol: true
  default_valence: true
  formal_charge: true
  hybridization: true
  hydrogen_count: true
  ring_size: true
  total_valence: true
```
	•	aromatic: Indicates whether the atom is part of an aromatic ring.
	•	atom_symbol: One-hot encodes of the atom’s chemical symbol (e.g., C, N, O).
	•	default_valence: One-hot encodes the default valence of the atom.
	•	formal_charge: One-hot encodes the formal charge of the atom.
	•	hybridization: One-hot encodes the hybridization state (e.g., sp3, sp2).
	•	hydrogen_count: One-hot encodes the number of hydrogen atoms attached.
	•	ring_size: Indicates the sizes of rings the atom is part of (e.g., 3-membered, 5-membered).
	•	total_valence: One-hot encodes the total valence of the atom.

#### Bond Features (bond_config)

Features
```
features:
  bond_type: true
  conjugated: true
  in_ring: true
  stereochemistry: false
```
	•	bond_type: One-hot encodes the bond type (e.g., single, double, triple, aromatic).
	•	conjugated: Indicates whether the bond is conjugated.
	•	in_ring: Indicates whether the bond is part of a ring.
	•	stereochemistry: One-hot encodes the stereochemistry of the bond (e.g., cis/trans). Currently set to false.

#### How It Works in the Code

The configurations defined in the YAML file are utilized by the AtomFeaturizer and BondFeaturizer classes to extract features from molecules.

MoleculeFeaturizer Class

	•	Location: mol2dreams/featurizer/featurize.py
	•	Purpose: Converts RDKit molecule objects into graph representations suitable for graph neural networks (GNNs) by featurizing atoms and bonds based on the configurations.

AtomFeaturizer Class

	•	Location: mol2dreams/featurizer/atom_features.py
	•	Purpose: Extracts atomic features based on the provided configuration.


Similar to AtomFeaturizer, BondFeaturizer processes features as per the configuration:

Adding New Features

If you wish to add new atomic or bond features, you can do so by:

	1.	Modifying the Featurizer Classes:
		Atom Features:
		Add your new feature extraction logic in atom_features.py within the AtomFeaturizer class.
		Bond Features:
		Add your new feature extraction logic in bond_features.py within the BondFeaturizer class.
	2.	Updating the Configuration:
		Add a new entry in the features section of atom_config or bond_config in your YAML configuration.
		Set the flag to true to include the new feature.

Example: Adding an Atom Mass Feature

	1.	Modify AtomFeaturizer:

class AtomFeaturizer:
    # ...

    def featurize(self, atom):
        features = []

        # Existing features...

        # New feature: Atomic Mass
        if self.config.get('atomic_mass', False):
            mass = atom.GetMass()
            mass_tensor = torch.tensor([mass], dtype=torch.float)
            features.append(mass_tensor)

        # Concatenate features...


	2.	Update YAML Configuration:
```
data_processing:
  atom_config:
    # ...
    features:
      # Existing features...
      atomic_mass: true
```


Graph Representation of Molecules

	•	Nodes: Represent atoms with their associated features.
	•	Edges: Represent bonds with their associated features.
	•	Graph Data: The MoleculeFeaturizer converts molecules into torch_geometric.data.Data objects, suitable for GNN models.

Dataset Preparation

The dataset preparation, including molecule parsing and embedding assignment, is handled in the massspecgym_dreams_embedding.ipynb notebook.

	•	SMILES Parsing: Molecules are parsed from SMILES strings using RDKit.
	•	Featurization: The MoleculeFeaturizer class featurizes each molecule according to your configuration.
	•	Graph Data: The featurized data is stored as a list of graph data objects, ready for training.
