# Converting Molecules to SMIRKS Patterns

The overall goal of `chemper` is to create hierarchical SMIRKS patterns that would maintain the clustering of input molecular subgraphs. 
`SMIRKS` patterns are a language for substructure searches that use decorators on the atoms and bonds in order to specify the desired chemistry (more details on this language are available in the [Daylight Theory Manual](http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html)). 
The first step in understanding how to convert molecular subgraphs into SMIRKS patterns was to extract all possible decorators on atoms and bonds from a molecule object. 
Therefore to begin we created the `ChemPerGraph` which stores all possible decorators for atoms and bonds in a graph structure which can be written out as a SMIRKS pattern. 

These objects were created largely as a precursor to `ClusterGraph` which allows for multiple molecules to be converted into a single graph object. 
However, it is possible some people will find them useful for their standalone functionality.
To the best of our knowledge, there was not previously a tool to create a SMIRKS pattern from a molecule to describe a fraction or all of the atoms in its structure. 
`RDKit` has a way to write molecules as "SMARTS" but as far as we can tell it just writes a SMILES string with square brackets around the atoms without any additional decorators used in `SMARTS` or `SMIRKS` and does not include support for atom indexing. 

`ChemPerGraph`'s can be built from a molecule and dictionary of key atoms using the `ChemPerGraphFromMol` class.
These classes depend on `chemper` molecules which are simply wrappers for common cheminformatics toolkits. 
Currently we support `RDKit` and `OpenEye` toolkits. 

In [2]:
# Install tools from chemper
from chemper.mol_toolkits import mol_toolkit
from chemper.graphs.fragment_graph import ChemPerGraphFromMol

## Make the Cyclopentane Molecule

In this notebook we will look at different options for `ChemPerGraphFromMol` always building from a cyclopentane molecule. 
So this notebook does not rely on a specific toolkit, we use the `mol_toolkit.MolFromSmiles` function which generates a `chemper` `Mol` using whichever toolkit is currently installed. 
However, if you have an `rdchem.Mol` or `oechem.OEMol` the same functionality can be used with a call to `mol_toolkit.Mol(m)` where `m` is a molecule from the toolkit you have installed.

In [5]:
mol = mol_toolkit.MolFromSmiles('C1CCCC1')
print(mol.get_smiles())

C1CCCC1


## Only the indexed atoms

In the most simple case, a `SMIRKS` pattern is created for only the indexed atoms. 
A `SMIRKS` index is used to tag specific atoms with a colon followed by a possitive integer. 
For example in `'[#6AX4:1]-!@[#1:2]'`, atom 1 (`:1`) is a carbon atom (`#6`) that is aliphatic (`A`) and has four connecting bonds (`X4`) and atom 2 (`:2`) is a hydrogen atom (`#1`). 
In that case, the two atoms are connected by a non-ring (`!@`) single bond (`-`).

`ChemPerGraphFromMol` objects are initiated with a chemper molecule and a dictionary storing atom indices by desired `SMIRKS` index. 
The dictionary should have a key for the desired `SMIRKS` index and the entry is the index for the atom in the molecule being assigned that `SMIKRS` index. 
For example if you provided the dictionary:
```
smirks_dict = {1:0, 2:3}
```
the atom with index `0` in your molecule would be used to make the atom in the SMIRKS pattern with `:1`. 
Similarly atom `3` in your molecule would be used to make the `SMIRKS` atom with `:2`.

In [6]:
# store atom 1 in smirks index 1 and atom 4 in smirks index 2
smirks_dict = {1:0, 2:4}
graph = ChemPerGraphFromMol(mol, smirks_dict)
graph.as_smirks()

'[#6AH2X4x2r5+0:1]-@[#6AH2X4x2r5+0:2]'

## Include Non-Indexed Atoms

The `ChemPerGraphFromMol` class also has an input option `layers` which specifies how many atoms away from the indexed atoms should be included in the `SMIRKS`.
The default `layers` value is `0` meaning only the indexed atoms are included in the `SMIRKS` pattern. 
If `layers` is greater than 0 then atoms up to that many bonds away from the indexed atoms are added to the graph. 

Here is an example with `layers=1` and then `layers=2`.

In [7]:
graph = ChemPerGraphFromMol(mol, smirks_dict, layers=1)
graph.as_smirks()

'[#6AH2X4x2r5+0:1](-!@[#1AH0X1x0r0+0])(-@[#6AH2X4x2r5+0])(-!@[#1AH0X1x0r0+0])-@[#6AH2X4x2r5+0:2](-!@[#1AH0X1x0r0+0])(-@[#6AH2X4x2r5+0])-!@[#1AH0X1x0r0+0]'

In [8]:
graph = ChemPerGraphFromMol(mol, smirks_dict, layers=2)
graph.as_smirks()

'[#6AH2X4x2r5+0:1](-!@[#1AH0X1x0r0+0])(-!@[#1AH0X1x0r0+0])(-@[#6AH2X4x2r5+0](-@[#6AH2X4x2r5+0])(-!@[#1AH0X1x0r0+0])-!@[#1AH0X1x0r0+0])-@[#6AH2X4x2r5+0:2](-!@[#1AH0X1x0r0+0])(-@[#6AH2X4x2r5+0](-!@[#1AH0X1x0r0+0])-!@[#1AH0X1x0r0+0])-!@[#1AH0X1x0r0+0]'

## Encode the whole molecule

The other option for `layers` is `'all'` which will continue adding atoms until there are no more atoms in the molecule.
These `SMIRKS` become really unreadable for humans, but do encode all information about the molecule. 

In [9]:
graph = ChemPerGraphFromMol(mol, smirks_dict, layers='all')
graph.as_smirks()

'[#6AH2X4x2r5+0:1](-!@[#1AH0X1x0r0+0])(-@[#6AH2X4x2r5+0](-!@[#1AH0X1x0r0+0])(-!@[#1AH0X1x0r0+0])-@[#6AH2X4x2r5+0](-!@[#1AH0X1x0r0+0])(-!@[#1AH0X1x0r0+0])-@[#6AH2X4x2r5+0](-!@[#1AH0X1x0r0+0])-!@[#1AH0X1x0r0+0])(-!@[#1AH0X1x0r0+0])-@[#6AH2X4x2r5+0:2](-!@[#1AH0X1x0r0+0])-!@[#1AH0X1x0r0+0]'