# Introduction

Extracting features from molecules is a common task in machine learning. There are 4 different types of features: 0D, 1D, 2D, 3D, or 4D.

- 0D features are descriptors that describe the individual parts of the molecule together as a whole, such as the number of atoms, bond counts or the molecular weight.
- 1D features are descriptors that describe substructures in the molecule (e.g. molecular fingerprints).
- 2D features are descriptors that describe the molecular topology based on the graph representation of the molecules, e.g. the number of rings or the number of rotatable bonds.
- 3D features are descriptors geometrical descriptors that describe the molecule as a 3D structure.
- 4D features are descriptors that describe the molecule as a 4D structure. A new dimension is added to characterize the interactions between the molecule and the active site of a receptor or the multiple conformational states of the molecule, e.g. the molecular dynamics of the molecule.


![features_image.png](features_image.png)

Source : Molecular Descriptors for Structure–Activity Applications: A Hands-On Approach.

As we increase the level of information about a molecule (from 0D to 4D), we also increase the computational cost of calculating the features. For example, calculating 3D features requires the generation of 3D conformers, which can be computationally expensive for large molecules. In addition, some features may not be available for certain molecules, e.g. 3D features cannot be calculated for molecules that do not have a 3D structure. Fortunately, DeepMol provides methods for generating compound 3D structures.

# Generating features using DeepMol

DeepMol provides a number of featurization methods for generating features from molecules. These features can be used for a variety of tasks, such as virtual screening, drug design, and toxicity prediction. The featurization methods are implemented as classes in the deepmol.compound_featurization module. Each class has a featurize method that takes a dataset as input and returns a featurized dataset. The featurize method can be called directly on a dataset object or used in a pipeline with other featurization methods.

The following featurization methods are currently available in DeepMol:
   - MorganFingerprint
    - AtomPairFingerprint
    - LayeredFingerprint
    - RDKFingerprint
    - MACCSkeysFingerprint
    - TwoDimensionDescriptors
    - WeaveFeat
    - CoulombFeat
    - CoulombEigFeat
    - ConvMolFeat
    - MolGraphConvFeat
    - SmileImageFeat
    - SmilesSeqFeat
    - MolGanFeat
    - All3DDescriptors

# Import packages

In [2]:
from deepmol.loaders import CSVLoader, SDFLoader
from deepmol.compound_featurization import MorganFingerprint, TwoDimensionDescriptors, MACCSkeysFingerprint, \
    AtomPairFingerprint, LayeredFingerprint, RDKFingerprint

from deepmol.compound_featurization import WeaveFeat, CoulombFeat, CoulombEigFeat, ConvMolFeat, MolGraphConvFeat, \
        SmileImageFeat, SmilesSeqFeat, MolGanFeat, All3DDescriptors, generate_conformers_to_sdf_file

import numpy as np

2023-05-26 16:48:07.882324: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-26 16:48:09.169372: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-05-26 16:48:09.169475: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric depe

# Load the dataset

In [4]:
dataset = CSVLoader("../data/CHEMBL217_reduced.csv", id_field="Original_Entry_ID",
                    smiles_field="SMILES", labels_fields=["Activity_Flag"]).create_dataset()

2023-05-26 16:48:27,972 — ERROR — Molecule with smiles: ClC1=C(N2CCN(O)(CC2)=C/C=C/CNC(=O)C=3C=CC(=CC3)C4=NC=CC=C4)C=CC=C1Cl removed from dataset.
2023-05-26 16:48:27,983 — INFO — Assuming classification since there are less than 10 unique y values. If otherwise, explicitly set the mode to 'regression'!


[16:48:27] Explicit valence for atom # 6 N, 5, is greater than permitted


# Featurize the dataset

## 1D features: fingerprints and structural keys

![fingerprints.png](fingerprints.png)

There are special codes called "structural keys" that have been created for various purposes in the field of chemistry.

They help with tasks like finding similar molecules or exploring different chemical structures. One specific type of structural keys is called the Molecular ACCess System (MACCS) keys.

These keys use binary digits (bits) to show whether certain parts of a molecule are present or not. For example, if a specific structural fragment exists in a molecule, the corresponding bit will be set to 1, and if it's not present, the bit will be set to 0. There are different versions of MACCS keys, but the most common ones are either 166 or 960 bits long. These keys provide a simplified representation of molecules, which is useful for various chemical analyses and comparisons. [1]

Hashed fingerprints are a type of chemical fingerprint that use a special function to convert patterns in molecules into a series of bits. The length of the fingerprint can be predetermined.

There are different types of fingerprints used in chemistry. Topological or path-based fingerprints, like Daylight fingerprints, provide information about how atoms are connected in a molecule. Circular fingerprints, such as ECFP, give information about the neighborhoods of atoms. These fingerprints are useful for quickly comparing similarities between molecules, studying the relationship between chemical structures and activities, and creating maps of chemical space.

Most fingerprints have been designed for small molecules and may not work well with larger ones. For example, ECFP4 is effective for virtual screening [2] and target prediction [3] with small molecules but may not accurately represent the overall features or structural differences of larger molecules [4].

On the other hand, atom-pair fingerprints, which describe molecular shape, are better suited for larger molecules [4]. However, they don't provide detailed structural information and may perform poorly in benchmarking studies with small molecules compared to substructure fingerprints like ECFP4 [4].


[1] L. David et al. “Molecular representations in AI-driven drug discovery: a review and practical guide”. In: Journal of Cheminformatics 12 (1 2020-12), p. 56

[2] S. Riniker and G. A. Landrum. “Open-source platform to benchmark fingerprints for ligand-based virtual screening”. In: Journal of cheminformatics 5.1 (2013), pp. 1–17

[3] M. Awale and J.-L. Reymond. “Polypharmacology browser PPB2: target prediction combining nearest neighbors with machine learning”. In: Journal of chemical information and modeling 59.1 (2018), pp. 10–17

[4] A. Capecchi, D. Probst, and J.-L. Reymond. “One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome”. In: Journal of cheminformatics 12.1 (2020), pp. 1–15


## Generate fingerprints and fragment keys with DeepMol

## Morgan Fingerprint

Morgan fingerprints, also known as circular fingerprints or Morgan/Circular fingerprints, are a type of molecular fingerprint that encodes the structural information of a molecule as a series of binary bitstrings.

These fingerprints are generated using the Morgan algorithm, which iteratively applies a circular pattern to a molecule, generating a series of concentric circles around each atom. The resulting bitstring is a binary representation of the presence or absence of certain substructures within a certain radius of each atom.

Morgan fingerprints are widely used in cheminformatics and computational chemistry for tasks such as molecular similarity analysis, virtual screening, and quantitative structure-activity relationship (QSAR) modeling. They are also computationally efficient and can be generated quickly for large sets of molecules, making them useful for high-throughput screening applications.

In [26]:
MorganFingerprint(n_jobs=10).featurize(dataset, inplace=True)

In [27]:
dataset.X.shape

(16645, 2048)

In [28]:
dataset.X[0]

array([0., 1., 0., ..., 0., 0., 0.], dtype=float32)

In [29]:
np.unique(dataset.X[0], return_counts=True)

(array([0., 1.], dtype=float32), array([2006,   42]))

## Atom Pair Fingerprint

Atom pair fingerprint is a type of molecular fingerprinting method used in cheminformatics and computational chemistry. It encodes the presence or absence of pairs of atoms in a molecule, as well as the distance between them.

The method involves dividing a molecule into atom pairs and then counting the frequency of each pair in the molecule. The result is a binary bitstring that represents the presence or absence of each atom pair in the molecule. The bitstring is usually truncated to a fixed length to facilitate comparison and analysis.

In [30]:
AtomPairFingerprint(n_jobs=10).featurize(dataset, inplace=True)

In [31]:
dataset.X.shape

(16645, 2048)

In [32]:
dataset.X[0]

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

In [33]:
np.unique(dataset.X[0], return_counts=True)

(array([0., 1.], dtype=float32), array([1841,  207]))

## Layered Fingerprint

Layered fingerprints, also known as topological fingerprints, are a type of molecular fingerprinting method used in cheminformatics and computational chemistry. They encode the presence or absence of certain substructures or functional groups in a molecule, which are represented as binary bitstrings.

The method involves dividing a molecule into a series of layers, where each layer contains a different set of substructures or functional groups. The bitstring for each layer is generated by hashing the presence or absence of the substructures or functional groups in the layer. The final fingerprint is generated by concatenating the bitstrings for all layers, resulting in a binary bitstring that represents the presence or absence of all substructures or functional groups in the molecule.

In [34]:
LayeredFingerprint(n_jobs=10).featurize(dataset, inplace=True)

In [35]:
dataset.X.shape

(16645, 2048)

In [36]:
dataset.X[0]

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

In [37]:
np.unique(dataset.X[0], return_counts=True)

(array([0., 1.], dtype=float32), array([1485,  563]))

## RDK Fingerprint

Fingerprints from rdkit

In [38]:
RDKFingerprint(n_jobs=10).featurize(dataset, inplace=True)

In [39]:
dataset.X.shape

(16645, 2048)

In [40]:
dataset.X[0]

array([1., 0., 1., ..., 0., 1., 1.], dtype=float32)

In [41]:
np.unique(dataset.X[0], return_counts=True)

(array([0., 1.], dtype=float32), array([1255,  793]))

## MACCS Keys Fingerprint

MACCS (Molecular ACCess System) keys are a type of binary molecular fingerprint used in cheminformatics and computational chemistry. They were developed by Molecular Design Limited (now part of Elsevier) and are widely used in the field.

The MACCS keys encode the presence or absence of certain molecular fragments or substructures in a molecule as a binary bitstring. The fragments used are based on a predefined set of SMARTS patterns, which represent specific substructures or features of a molecule.

The MACCS keys consist of 166 bit positions, with each bit representing the presence or absence of a specific fragment in the molecule. The bitstring can be used to compare the similarity of two molecules or to search a large database of molecules for compounds with similar structures or properties.

In [46]:
MACCSkeysFingerprint(n_jobs=10).featurize(dataset, inplace=True)

In [47]:
dataset.X.shape

(16645, 167)

In [48]:
dataset.X[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
       1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.,
       0., 0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0., 1., 0., 1., 1., 1.,
       0., 1., 0., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1.,
       0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.],
      dtype=float32)

In [49]:
np.unique(dataset.X[0], return_counts=True)

(array([0., 1.], dtype=float32), array([120,  47]))

## 0D and 2D Descriptors

We provide all 0D, 2D descriptors and some 1D descriptors from rdkit in only one function. These include: 
- **EState index descriptors**: The EState indices are calculated based on a set of predefined atomic parameters, such as electronegativity, atomic polarizability, and resonance effects. These indices quantify the electronic characteristics of individual atoms in a molecule.
    - **MaxAbsEStateIndex (MAEstate)**: Maximum absolute EState index - The MAEstate specifically represents the highest absolute EState index value among all the atoms in a molecule. It indicates the atom with the largest charge magnitude, reflecting its potential reactivity or contribution to chemical properties.
    - **MaxEStateIndex:**: Maximum EState Index - The MaxEStateIndex specifically represents the highest EState index value among all the atoms in a molecule. It indicates the atom with the largest charge or electronic density, reflecting its potential reactivity or significance in the molecule's properties.
    - **MinAbsEStateIndex**: Minimum absolute EState index - The MinAbsEStateIndex specifically represents the lowest absolute EState index value among all the atoms in a molecule.
    - **MinEStateIndex**: Minimum EState index - The MinEStateIndex represents the lowest EState index value among all the atoms in a molecule.
- **QED**: quantitative estimation of drug-likeness - ADS (Atom-based Descriptors) are a set of descriptors used in chemoinformatics and computational chemistry to characterize the properties of individual atoms within a molecule. These descriptors provide valuable information about the local atomic environment and contribute to understanding the molecular properties and behavior. ADS descriptors are typically calculated based on quantum chemical calculations.
- **Molecular weight descriptors**:
    - **MolWt**: Molecular weight. 
    - **HeavyAtomMolWt**: The average molecular weight of the molecule ignoring hydrogens.
    - **ExactMolWt**: The exact molecular weight of the molecule.
- **Electron descriptors**:
    - **NumValenceElectrons**: The number of valence electrons the molecule has.
    - **NumRadicalElectrons**: The number of radical electrons the molecule has.
- **Charge descriptors**:
    - **MaxPartialCharge**: Maximum partial charge;
    - **MinPartialCharge**: Minimum partial charge;
    - **MaxAbsPartialCharge**: Maximum absolute partial charge;
    - **MinAbsPartialCharge**: Minimum absolute partial charge;
    
- **Morgan fingerprint density** - quantify the frequency of occurrence of specific substructures within the molecule at a local level, taking into account their immediate surroundings. Higher values of density indicate a higher density of unique substructures in the molecule, while lower values indicate fewer unique substructures or a more uniform distribution of substructures.
    - **FpDensityMorgan1**: Fingerprint Density for Morgan Radius 1.
    - **FpDensityMorgan2**: Fingerprint Density for Morgan Radius 2.
    - **FpDensityMorgan3**: Fingerprint Density for Morgan Radius 3.
- **BCUT2D descriptors**: BCUT2D descriptors are a set of descriptors that capture the electronic properties of a molecule based on the distribution of atom-centered fragments  
    - **BCUT2D_MWHI**: BCUT2D descriptor related to molecular weight (high range). It captures information about the electronic properties of the molecule based on the distribution of atom-centered fragments.

    - **BCUT2D_MWLOW**: BCUT2D descriptor related to molecular weight (low range). It provides information about the electronic properties of the molecule based on the distribution of atom-centered fragments.

    - **BCUT2D_CHGHI**: BCUT2D descriptor related to atom charge (high range). It describes the electronic properties of the molecule based on the distribution of atom-centered fragments and their charges.

    - **BCUT2D_CHGLO**: BCUT2D descriptor related to atom charge (low range). It provides information about the electronic properties of the molecule based on the distribution of atom-centered fragments and their charges.

- **AvgIpc**: Average Burden Eigenvalue. It is a descriptor that measures the topological complexity of the molecule by considering the eigenvalues of the Burden matrix, which encodes the connectivity information of the molecule.

- **BalabanJ**: Balaban's J index. It quantifies the molecular topological structure by considering the connectivity of atoms and bonds in the molecule.

- **BertzCT**: Bertz complexity index. It measures the complexity or branching of a molecule based on its structural connectivity.

- **Chi0, Chi1, Chi2n, Chi3v, etc.**: Chi indices are topological descriptors that characterize the molecular shape and size. Each Chi index captures a different aspect of the connectivity and branching patterns in the molecule.

- **HallKierAlpha**: Hall-Kier alpha shape index. It describes the shape of a molecule based on the distribution of bond lengths and angles.

- **MolLogP**: Molar logarithm of the partition coefficient (logP). It quantifies the lipophilicity or hydrophobicity of a molecule, which is important for its distribution and permeability properties.

- **TPSA**: Topological polar surface area. It estimates the surface area of a molecule that is involved in polar interactions, which is relevant for its solubility and biological activity.

- **NumHAcceptors, NumHDonors, NumHeteroatoms, etc.**: These descriptors count the number of hydrogen bond acceptor groups, hydrogen bond donor groups, heteroatoms (non-carbon atoms), etc., present in the molecule. They provide information about the potential for specific molecular interactions.

- **RingCount**: Number of rings in the molecule. It indicates the level of molecular complexity and rigidity.

- **fr_Al_COO, fr_ArN, fr_COO, fr_Ph_OH, etc**.: These descriptors represent the count of specific functional groups or substructures in the molecule. They provide information about the presence of particular chemical moieties.

In [5]:
TwoDimensionDescriptors(n_jobs=10).featurize(dataset, inplace=True)

In [6]:
dataset.feature_names

array(['MaxAbsEStateIndex', 'MaxEStateIndex', 'MinAbsEStateIndex',
       'MinEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt',
       'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge',
       'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge',
       'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3',
       'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO',
       'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW',
       'AvgIpc', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1',
       'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n',
       'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3',
       'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12',
       'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4',
       'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9',
       'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4',
 

In [8]:
dataset.X[0]

array([ 1.2915942e+01,  1.2915942e+01,  1.5547052e-02, -1.0479454e+00,
        6.2784576e-01,  3.2834299e+02,  3.1120700e+02,  3.2812231e+02,
        1.2400000e+02,  0.0000000e+00,  1.9552433e-01, -4.9668738e-01,
        4.9668738e-01,  1.9552433e-01,  1.1250000e+00,  1.8750000e+00,
        2.5833333e+00,  1.9142143e+01,  1.0232034e+01,  2.1378713e+00,
       -2.1238039e+00,  2.2726576e+00, -2.0836904e+00,  5.4712963e+00,
        2.0794968e-01,  2.9879730e+00,  1.8310844e+00,  8.3053522e+02,
        1.7104084e+01,  1.2978706e+01,  1.2978706e+01,  1.1562881e+01,
        7.3246827e+00,  7.3246827e+00,  5.2568851e+00,  5.2568851e+00,
        3.6155939e+00,  3.6155939e+00,  2.3750753e+00,  2.3750753e+00,
       -2.9900000e+00,  3.7483522e+05,  1.5888650e+01,  6.6362123e+00,
        3.3332713e+00,  1.3811201e+02,  2.0266706e+01,  1.1566732e+01,
        1.2107889e+01,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        4.5670996e+00,  4.3904152e+00,  0.0000000e+00,  0.0000000e+00,
      

# 3D descriptors

## Generating Conformers and exporting to a SDF file

3D structures can be generated with DeepMol and being exported to a file.

We start by generating conformers with **ETKDG** (Efficient Conformer Generation) algorithm. The ETKDG method is a widely used algorithm for generating low-energy conformers of small organic molecules. It is an extension of the original **KDG** (Knowledge-Embedded Stochastic Conformer Generation) method and incorporates additional efficiency enhancements. 

The **ETKDG** method combines random sampling with knowledge-based rules and efficient energy evaluations to generate a diverse set of low-energy conformers for small organic molecules. It strikes a balance between computational efficiency and conformational coverage, making it a popular choice for various molecular modeling and drug discovery applications.

After that **MMFF** (Merck Molecular Force Field) and **UFF** (Universal Force Field) algorithms. They commonly use force fields for optimizing conformers of small organic molecules. Both force fields calculate the potential energy of a molecule based on its geometry and provide a set of atomic forces that guide the conformational search towards more stable conformations.

You can generate conformers and export them to a SDF files as follow:

In [None]:
dataset = CSVLoader("../data/CHEMBL217_reduced.csv", id_field="Original_Entry_ID",
                    smiles_field="SMILES", labels_fields=["Activity_Flag"]).create_dataset()
generate_conformers_to_sdf_file(dataset, "CHEMBL217_conformers.sdf", n_conformations=1, threads=15,max_iterations=3)

If you rather want to 

In [None]:
dataset = SDFLoader("../data/CHEMBL217_conformers.sdf", id_field="_ID", labels_fields=["_Class"]).create_dataset()

## 3D descriptors

In [11]:
All3DDescriptors(mandatory_generation_of_conformers=False).featurize(dataset, inplace=True)

In [None]:
dataset.X

In [None]:
dataset.feature_names

## DeepChem Featurization

### Weave Featurization

Weave convolutions were introduced in [1]_. Unlike Duvenaud graph convolutions, weave convolutions require a quadratic matrix of interaction descriptors for each pair of atoms. These extra descriptors may provide for additional descriptive power but at the cost of a larger featurized dataset. Weave convolutions are implemented in DeepChem as the WeaveFeaturizer class.

[1] Kearnes, Steven, et al. "Molecular graph convolutions: moving beyond fingerprints." Journal of computer-aided molecular design 30.8 (2016): 595-608.

In [None]:
WeaveFeat(n_jobs=10).featurize(dataset)

### Coulomb Featurization

Coulomb matrices provide a representation of the electronic structure of a molecule. For a molecule with N atoms, the Coulomb matrix is a N X N matrix where each element gives the strength of the electrostatic interaction between two atoms. The method is described in more detail in [1]_.

[1] Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.

In [None]:
CoulombFeat(n_jobs=10,max_atoms=10).featurize(dataset)

### Coulomb Eig Featurization

This featurizer computes the eigenvalues of the Coulomb matrices for provided molecules. Coulomb matrices are described in [1]_. This featurizer is useful for computing the eigenvalues of the Coulomb matrices for molecules in a dataset.

[1] Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.

In [None]:
CoulombEigFeat(n_jobs=10,max_atoms=10).featurize(dataset)

### ConvMolFeat

Duvenaud graph convolutions [1]_ construct a vector of descriptors for each atom in a molecule. The featurizer computes that vector of local descriptors.

    [1] Duvenaud, David K., et al. “Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.

In [None]:
ConvMolFeat(n_jobs=10).featurize(dataset)

### MolGraphConvFeat

This class is a featurizer of general graph convolution networks for molecules.

The default node(atom) and edge(bond) representations are based on WeaveNet paper. If you want to use your own representations

In [None]:
MolGraphConvFeat(n_jobs=10).featurize(dataset)

### SmileImageFeat

SmilesToImage Featurizer takes a SMILES string, and turns it into an image. Details taken from [1]_.

The default size of for the image is 80 x 80. Two image modes are currently supported - std & engd. std is the gray scale specification, with atomic numbers as pixel values for atom positions and a constant value of 2 for bond positions. engd is a 4-channel specification, which uses atom properties like hybridization, valency, charges in addition to atomic number. Bond type is also used for the bonds.

The coordinates of all atoms are computed, and lines are drawn between atoms to indicate bonds. For the respective channels, the atom and bond positions are set to the property values as mentioned in the paper.

[1] Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

In [None]:
SmileImageFeat(n_jobs=10).featurize(dataset)

### SmilesSeqFeat

SmilesToSeq Featurizer takes a SMILES string, and turns it into a sequence. Details taken from [1]_.

SMILES strings smaller than a specified max length (max_len) are padded using the PAD token while those larger than the max length are not considered. Based on the paper, there is also the option to add extra padding (pad_len) on both sides of the string after length normalization. Using a character to index (char_to_idx) mapping, the SMILES characters are turned into indices and the resulting sequence of indices serves as the input for an embedding layer.

[1] Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

In [None]:
SmilesSeqFeat().featurize(dataset)

### MolGanFeat

Featurizer for MolGAN de-novo molecular generation [1]_. The default representation is in form of GraphMatrix object. It is wrapper for two matrices containing atom and bond type information. The class also provides reverse capabilities.

[1] MolGAN: An implicit generative model for small molecular graphs. https://arxiv.org/abs/1805.11973

In [None]:
MolGanFeat(n_jobs=10).featurize(dataset)