# Generating reasonable SMIRKS patterns

In this notebook we will demonstrate how `chemper`'s `SMIRKSifier` works to generate SMIRKS patterns for a list of molecules and assigned clustering. In other words, the goal is to generate a list of SMIRKS patterns which will maintain the clustering of molecular fragments the user specifies.  For example, imagine you have determined the force field parameters for all bonds in your molecules set. You could group the bonds based on those which should be assigned the same force constant and equilibrium bond length. The goal of `chemper`'s tools is to generate a hierarchical list of SMIRKS that will maintain your clustering. 

`chemper`'s `ClusterGraph` can create a single SMIRKS pattern for a group of molecular fragments. These SMIRKS patterns are fully specified using all possible SMIRKS decorators for each atom. The `SMIRKSifier` takes advantage of the `ClusterGraph` functionality and then removes unnecessary SMIRKS decorators so the final list of patterns is as generic as possible. 

In this example, we assume that we want to group bonds based on their bond order so all single bonds should be in one group, all double in another group, and so on. The steps shown below are as follows:

1. Create `chemper` molecules for a list of SMILES 
2. Classify the bonds in each molecule as single, double, aromatic, triple. Then group those bonds (based on atom indices) into each of those categories. 
3. Use `chemper.optimize_smirks.smirksify` to automatically create a hierarchical SMIRKS pattern list, then run it to remove unecessary decorators. 

In [1]:
from chemper.mol_toolkits import mol_toolkit
from chemper.chemper_utils import create_tuples_for_clusters
from chemper.smirksify import SMIRKSifier, print_smirks

## 1. Create a list of Molecules

Here we chose a list of SMILES strings and then use `chemper.mol_toolkits` to create a list of molecule objects. 

In [2]:
smiles_list = ["CCCCC", "c1ccccc1", "C1=CNC=C1", "CC=CC", "C(=O)OC", "C1C=CCCC1"]
molecules = [mol_toolkit.MolFromSmiles(s) for s in smiles_list]

## 2. Classify bonds

In this section we classify bonds based on the categories

* single
* aromatic
* double
* triple

This is done with the utility function `create_tuples_for_clusters` which creates a list atom indices (as tuples), for each molecule
` [ ('label', [ [(tuple of atoms for each molecule), ...] ...]) ] `.

In a bond we have two indexed atoms in a SMIRKS (1 and 2) because you need two atoms in order to identify a bond. For example, in the first molecule above, there is a single bond between atoms 0 and 1. The cluster_list would specify that bond with a tuple `(0,1)`. There is a list of tuples for each molecule associated with each label.

In this example, there are six molecules. As an illustration of how the cluster_list is stuctured, consider the aromatic bonds, at `cluster_list[1]`. 
Only molecules 1 and 2 have aromatic bonds. The bonds in these molecules are specified by tuples showing the atom indices for each of the aromatic bonds below.
The other four molecules have zero aromatic bonds so the associated lists are empty as no bonds need to be specified.


### Moving forward

Obviously in the long run we don't want to start with SMIRKS pattern, however, you could imagine identifying the equilibrium bond length and force constant for a variety of bonds. You could then cluster those bonds based on which parameters they should be assigned. You could give `chemper` these clusters of bonds as well. 

In [3]:
smirks_labels = [('sing', '[*:1]-[*:2]'),
                 ('aromatic', '[*:1]:[*:2]'),
                 ('double', '[*:1]=[*:2]'),
                ('triple', '[*:1]#[*:2]'), 
                ]
cluster_list = create_tuples_for_clusters(smirks_labels, molecules)

In [4]:
cluster_list[1]

('aromatic',
 [[],
  [(0, 1), (1, 2), (4, 5), (0, 5), (2, 3), (3, 4)],
  [(1, 2), (0, 1), (0, 4), (2, 3), (3, 4)],
  [],
  [],
  []])

## 3. Generate SMIRKS and remove unnecessary SMIRKS decorators

The goal in this step is to create a generic, hierarchical list of SMIRKS patterns which will maintain the clustering of bonds we specified above. 

First we will create a `SMIRKSifier` object. This takes your molecules and the list of classified bonds and automatically creates SMIRKS patterns using ALL possible decorators. As you can see this process leads to highly specific patterns which would not be practical assuming you want your clustering to be applied to molecules outside your training set. 

There area also two optional arguments for the `SMIRKSifier`:

* `layers`: this is how many atoms away from the indexed atoms you want included in your automatically generated SMIRKS patterns.
    - **Note** `layers` will probably not stay a user input since we could determine automatically if adding a layer to each SMIRKS is necessary. 
* `verbose`: if `True` (the default) the `SMIRKSifier` prints out information while matching the automatically generated SMIRKS with the initially assigned clusters. 

### 3a. Create the `SMIRKSifier` printing out initial SMIRKS patterns

In [5]:
bond_smirksifier = SMIRKSifier(molecules, cluster_list, layers=0)


 Label                | SMIRKS 
 zz_sing              | [#6!rAH1X3x0,#6!rAH2X4x0,#6!rAH3X4x0,#6AH1X3r6x2,#6AH2X4r6x2,#6H1X3ar5x2,#6H1X3ar6x2,#7H1X3ar5x2,#8!rAH0X2x0;+0:1]-[#1!rH0X1x0,#6!rH1X3x0,#6!rH2X4x0,#6!rH3X4x0,#6H1X3r6x2,#6H2X4r6x2,#8!rH0X2x0;+0;A:2] 
--------------------------------------------------------------------------------
 zz_aromatic          | [#6r5,#6r6,#7r5;+0;H1;X3;a;x2:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2] 
--------------------------------------------------------------------------------
 zz_double            | [*!rx0,*r6x2;#6;+0;A;H1;X3:1]=[#6!rH1X3x0,#6H1X3r6x2,#8!rH0X1x0;+0;A:2] 
--------------------------------------------------------------------------------



### 3b. Start removing decorators

The `SMIRKSifier.reduce` function attempts to remove a single decorator from a randomly chosen SMIRKS pattern during each iteration, it has two argument:

* `max_its`(optional, default=1000): Number of iterations to remove 
    - currently it does run for this many we are working on determining if there is a way to determine if we are done before the number of iterations is reached
* `verbose` (optional, default=do not change the setting): This will temporarily change the `SMIRKSifier`'s verboseness, so you could make a long run more quiet. 

This run returns the current set of SMIRKS patterns at the end of the simulation. You can use the internal `SMIRKSifier.print_smirks` function to print these in a semi-nicely formatted way. 


In [6]:
smirks10 = bond_smirksifier.reduce(max_its=10)

Iteration:  0
Attempting to change SMIRKS #0
[#6!rAH1X3x0,#6!rAH2X4x0,#6!rAH3X4x0,#6AH1X3r6x2,#6AH2X4r6x2,#6H1X3ar5x2,#6H1X3ar6x2,#7H1X3ar5x2,#8!rAH0X2x0;+0:1]-[#1!rH0X1x0,#6!rH1X3x0,#6!rH2X4x0,#6!rH3X4x0,#6H1X3r6x2,#6H2X4r6x2,#8!rH0X2x0;+0;A:2]  -->  [#6H1Ax0X3,#6X4Ax0H2,#6X4Ax0H3,#6H1Ar6x2X3,#6X4Ar6x2H2,#6H1r5x2X3a,#6H1r6x2X3a,#7H1r5x2X3a,#8Ax0X2H0;+0:1]-[#1X1x0H0,#6H1x0X3,#6X4x0H2,#6X4x0H3,#6H1r6x2X3,#6X4r6x2H2,#8x0X2H0:2]
Rejected!
 proposed SMIRKS changed the way fragments are clustered
------------------------------------------------------------------------------------------
Iteration:  1
Attempting to change SMIRKS #1
[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]  -->  [#6r5,#6r6,#7r5;+0;H1;X3;a;x2:1]:[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]
Accepted! 
------------------------------------------------------------------------------------------
Iteration:  2
Attempting to change SMIRKS #2
[*!rx0,*r6x2;#6;+0;A;H1;X3:1]=[#6!rH1X3x0,#6H1X3r6x2,#8!rH0X1x0;+0;A:2]  -->  [#

In [7]:
print_smirks(smirks10)


 Label                | SMIRKS 
 zz_sing              | [#6!rAH1X3x0,#6!rAH2X4x0,#6!rAH3X4x0,#6AH1X3r6x2,#6AH2X4r6x2,#6H1X3ar5x2,#6H1X3ar6x2,#7H1X3ar5x2,#8!rAH0X2x0;+0:1]-[#1!rH0X1x0,#6!rH1X3x0,#6!rH2X4x0,#6!rH3X4x0,#6H1X3r6x2,#6H2X4r6x2,#8!rH0X2x0;+0;A:2] 
--------------------------------------------------------------------------------
 zz_aromatic          | [#6,#6,#7;x2;+0;X3;a:2]:[*;H1;+0;X3;a:1] 
--------------------------------------------------------------------------------
 zz_double            | [#6H1x0X3,#6H1r6x2X3,#8X1x0H0;+0;A:2]=[*;#6;+0;A;H1;X3:1] 
--------------------------------------------------------------------------------



### 3c. Continue removind decorators

Now we will continue trying to reduce the SMIRKS. Note, in this case we set verbose to False and just print the final SMIRKS since 3,000 is a lot of steps. 

In [8]:
smirks3k = bond_smirksifier.reduce(max_its=3000, verbose=False)
print_smirks(smirks3k)


 Label                | SMIRKS 
 zz_sing              | [*:1]~[*:2] 
--------------------------------------------------------------------------------
 zz_aromatic          | [*:1]:[*:2] 
--------------------------------------------------------------------------------
 zz_double            | [*:2]=[*:1] 
--------------------------------------------------------------------------------



## 4. What have we learned for the future 

In this section we make note of what we learned from this example and potential improvements for `chemper` in the near future. 

**1. Can we automatically determine the number of necessary layers?**

Currently the user has to set the number of layers, or how many bonds away from the indexed atoms should be included in the initial SMIRKS patterns. However, the point of `chemper` tools is to automatically determine the SMIRKS patters. It would be preferable to have the number of layers determined automatically based on how many are actually necessary.

The answer here seems to be of course we can, you just type with 0 layers and then try systematically adding them until you get a 100% correspondence between the way molecules are typed with the automatically created SMIRKS and the way they were assigned to be clustered

**2. Is there a systematic way to remove decorators that doesn't introduce too much human wizardary?**

Right now the removal of decorators is stochastic, so you don't guarentee the same SMIRKS will be created for the same clustering of atoms every time. This is because it is possible to have multiple combinations of decorators that which maintain the same clustering. 

We could consider looking for differences in the ClusterGraphs and start by removing any decorators that are in common for all SMIRKS since those are clearly not distinguishing features. However, it seems unlikely that a systematic removal wouldn't be biased by the choices of the human who chose the order for checking the removal of the decorators. 