# How does the code work?


There are two different types of GRNN that can be used.

1. A model that learns from smiles fragments of equal length and predicts the next character of the sequence.
2. A model that learns from smiles fragments of different lengths and predicts the next character of the sequence

## Model 1


This model can be used in 2 ways.

### First way

The model is instantiated and the smiles strings are stored in the class directly. Then, one needs to pass the indices of the smiles strings that should be used for training. 

In [1]:
import sys
import os
sys.path.insert(0, os.path.abspath("/Volumes/Transcend/repositories/NovaData/models"))
import sklearn_models as sm

smiles = ["CC(=O)NC(CS)C(=O)Oc1ccc(NC(C)=O)cc1", "COc1ccc2CC5C3C=CC(O)C4Oc1c2C34CCN5C", "O=C(C)Oc1ccccc1C(=O)O"]
idx = [0, 1, 2]

estimator = sm.Model_1(smiles=smiles, window_length=10)
estimator.fit(idx)

Using TensorFlow backend.
  self.model.fit(X_hot, y_hot, batch_size=batch_size, verbose=1, nb_epoch=self.nb_epochs)


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


What happens when the estimator is instantiated with smiles strings? 

1. The argumend `smiles` is checked to make sure it is a list of strings. 
2. The smiles strings are one-hot encoded. During this process, the following things are done:

    1. The unique characters present in the smiles strings are gathered and a dictionary is created that turns every character into an integer. This includes 'G', 'E' and 'A'.
    2. All molecules are modified so that they start with 'G' and end with 'E'.
    3. The smiles strings are then split into overlapping windows of length `window_length`. These constitute the 'X' part of the data set.
    4. For each window, the character that follows that window in the smile string is also stored. These constitute the 'Y' part of the data set.
    5. Both the X and Y part of the data set are one-hot encoded and *stored* in the class.

What happens when the estimator is fit to the data?

1. The fit method receives the indices of the smiles strings to use for training.
2. There is a check to make sure that there is data (smiles strings) stored in the class. 
3. Another function converts the indices passed as an argument (which correspond to entire smiles strings) to the indices of the overlapping windows.
4. Then, the windows that are needed for training are extracted from the data set and used for the fit.

In [2]:
predictions = estimator.predict(X=idx)

What happens when the predict function is called?

1. The argument `X` of the predict function is checked. For this estimator it cannot be None. In addition, when this estimator is used in this way, only indices can be passed, not new smiles strings.
2. The smiles corresponding to the indices are extracted and so are the windows corresponding to those smiles.
3. For each *smiles*, the first window is input into the model and the next character is predicted.
4. This character is then appended to the end of the initial window while the first character is dropped. 
5. Point 4 is repeated until either the character 'E' is produced or until the predicted smile has reached a length of 100.
6. The first and last character ('G' and 'E') of the predicted smiles are removed and the smiles are returned.

### Second way

The estimator is instantiated but no smiles strings are given. The smiles are passed to the fit function.

In [3]:
estimator = sm.Model_1(window_length=10)
estimator.fit(smiles)

  self.model.fit(X_hot, y_hot, batch_size=batch_size, verbose=1, nb_epoch=self.nb_epochs)


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


What happens when the model is instantiated without smiles strings?

1. All the hyper-parameters are set in the same way as for when smiles are given.
2. The variable `estimator.smiles` is set to `None`.

What happens in the fit function?

1. The data is passed to a function that checks whether `estimator.smiles` contains data.
2. The data passed is checked to make sure that it is a list of strings.
3. The smiles are hot encoded:
    1. A list of all the unique characters present in the smiles given is created.
    2. From this list, each unique character is given a corresponding index and a dictionary is made with each character and their index.
    3. The smiles have 'G' and 'E' added to the extremities.
    4. The smiles strings are then split into overlapping windows of length `window_length`. These constitute the 'X' part of the data set.
    5. For each window, the character that follows that window in the smile string is also stored. These constitute the 'Y' part of the data set.
    6. Both the X and Y part of the data set are one-hot encoded
4. The hot encoded X and Y part of the data are used for the fitting.

In [4]:
predictions = estimator.predict(smiles)

What happens in the predict function?

1. A function checks that smiles have been bassed to the predict function.
2. In the one-hot encoding function, the smiles strings have 'G' and 'E' added to the extremities. They are then split into windows and each window is one-hot encoded.
3. Smiles strings (not one-hot encoded) have also 'G' and 'E' appended to the extremities. This is because they are needed in the actual prediction part.
4. At prediction time, the first hot-encoded window is taken and the next character is predicted. 
5. The character is appended to the window and the first character is removed.
6. Step 5 is repeated until either the character 'E' is produced or until the smiles reaches length 100.
7. The 'G' and 'E' characters are removed.

*Note on the indices of the windows*:
Normally, the number of windows in a string of length `sample_length` is:
`sample_length - window_length + 1`
However, here we need to leave a character out for predictions. So there are
`sample_length - window_length` 
windows.
When predicting, we want to start from the first window of each sample. Since there are `sample_length - window_length` windows, the indices go from 0 to `sample_length - window_length - 1` for each sample. So, the first window of the next sample is at index `sample_length - window_length` in `X_hot`.

## Model 2

This model can also be used in 2 ways.

### First way

Similarly as for the first model, one can instantiate the model and store the smiles strings directly in the class. Then, one needs to pass the indices of the smiles strings that should be used for training. 

In [5]:
estimator = sm.Model_2(smiles=smiles)
estimator.fit(idx)



  self.model.fit(X_hot, y_hot, batch_size=batch_size, verbose=1, nb_epoch=self.nb_epochs)


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


What happens when the model is instantiated with smiles strings?

1. The smiles are checked to make sure they make sense
2. The smiles are hot encoded:
    1. All the smiles are appended with 'G' and 'E'
    2. The unique characters present in the smiles are extracted and the length of the longest smile is recorded (including the 'G' and 'E' characters).
    3. The molecules that are shorter than the maximum length are padded with characters 'A' after the 'E'. In this way all of the smiles are the same length.
    4. The padded smiles are one-hot encoded and stored in the class.
    
What happens when the fit function is called?

1. It is checked that data has already been stored in the class and the indices are checked.
2. The required hot-encoded smiles strings are extracted and passed on for training. These consitute the X part of the data.
3. The one-hot encoded smiles are shifted by one (if the smiles string was 'GCCCE' it becomes 'CCCE') and the last character is left being nothing (all zeros in the one-hot encoded version). This is the Y part of the data.
4. The X and the Y part of the data are used for training.

In [8]:
estimator.predict(idx, frag_length=10)

['CC(=O)NC(Ccccc', 'COc1ccc2CCc', 'O=C(C)Oc1cccc']

In [7]:
estimator.predict()

['ccccccc']

The predict function can be called in two ways:

1. Indices of stored smiles strings are passed: In this case the predictions will be done from fragments of existing smiles. These fragments have length `frag_length`. Predictions from these fragments continues until either the 'E' character is produced or the smiles reaches length 100. Then, 'G', 'E' and 'A' characters are removed.
2. Nothing is passed as an argument: In this case only 1 prediction will be done. The predictions will start from 'G' alone. Predictions continues until either the 'E' character is produced or the smiles reaches length 100. Then, 'G', 'E' and 'A' characters are removed.

### Second way

Similarly as for the first model, the estimator is instantiated but no smiles strings are given. The smiles are passed to the fit function.

In [9]:
estimator = sm.Model_2()
estimator.fit(smiles)



  self.model.fit(X_hot, y_hot, batch_size=batch_size, verbose=1, nb_epoch=self.nb_epochs)


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


What happens when the model is instantiated?

1. The variable `estimator.smiles` is set to None
2. All the other hyper-parameters are set normally

What happens when the model is fit to the data?

1. The smiles strings are checked to make sure they are list of strings.
2. The smiles strings are one-hot encoded:
    1. The unique characters are extracted and the length of the longest smiles is recorded.
    2. The smiles are appended with 'G' and 'E' and padded with 'A' so that they all have the same length.
    3. The smiles are then one-hot encoded.
3. The one-hot encoded smiles are used for training

In [10]:
estimator.predict(smiles, frag_length=10)

['CC(=O)NC(CS)C(=O)Oc1ccc(NC(C)=O)cc1',
 'COc1ccc2CC5C3C=CC(O)C4Oc1c2C34CCN5C',
 'O=C(C)Oc1ccccc1C(=O)O']

In [11]:
estimator.predict()

['cccccc']

Again, the predict function can be called in two ways:

1. Smiles strings are passed: 
    1. The first fragment of each new smiles strings is one-hot encoded using the dictionary of unique characters and indices generated in the fit function. In the hot-encode function, the fragment is appended at the beginning with a  'G'.
    2. The predictions will be done from fragments of the given smiles. These fragments have length `frag_length`. Predictions from these fragments continues until either the 'E' character is produced or the smiles reaches length 100. Then, 'G', 'E' and 'A' characters are removed.
2. Nothing is passed as an argument: In this case only 1 prediction will be done. The predictions will start from 'G' alone. Predictions continues until either the 'E' character is produced or the smiles reaches length 100. Then, 'G', 'E' and 'A' characters are removed.