# Usage of the preprocessor classes

The `Preprocessor` abstract class defines a number of abstract methods for transforming and encoding graph based inputs. Child classes have been implemented for transforming rdkit molecules and pymatgen structures.

In [19]:
from nfp.preprocessing import SmilesPreprocessor
preprocessor = SmilesPreprocessor(explicit_hs=False)

The default `MolPreprocessor` and `SmilesPreprocessor` classes return three arrays, featurized representations of the graph nodes (atoms), edges (bonds) and a connectivity array

In [21]:
preprocessor('CCO', train=True)

{'atom': array([2, 4, 3], dtype=int32),
 'bond': array([3, 3, 2, 2], dtype=int32),
 'connectivity': array([[0, 1],
        [1, 0],
        [1, 2],
        [2, 1]])}

Here, the integer classes are assigned based on underlying `Tokenizer` classes.

In [22]:
preprocessor.atom_tokenizer._data

{'unk': 1,
 "('C', 1, 3, 3, False)": 2,
 "('O', 1, 1, 1, False)": 3,
 "('C', 2, 2, 2, False)": 4}

The functions used to 'featurize' atoms and bonds 

In [None]:
[preprocessor(smiles) for smiles in ['Cc1ccco1', 'C1CCOC1', 'Cc1ccccc1C']]

In [None]:
'C1CCNC1'