# How does the code work?


There are two different types of GRNN that can be used.

1. A model that learns from smiles fragments of equal length and predicts the next character of the sequence.
2. A model that learns from smiles fragments of different lengths and predicts the next character of the sequence

## Model 1


This model can be used in multiple ways.

### First way

The model is instantiated and the smiles strings are stored in the class directly. Then, one needs to pass the indices of the smiles strings that should be used for training. 

In [1]:
import sys
import os
sys.path.insert(0, os.path.abspath("/Volumes/Transcend/repositories/NovaData/models"))
import sklearn_models as sm

smiles = ["CC(=O)NC(CS)C(=O)Oc1ccc(NC(C)=O)cc1", "COc1ccc2CC5C3C=CC(O)C4Oc1c2C34CCN5C", "O=C(C)Oc1ccccc1C(=O)O"]
idx = [0, 1, 2]

estimator = sm.Model_1(smiles=smiles, window_length=10)
estimator.fit(idx)

Using TensorFlow backend.
  self.model.fit(X_hot, y_hot, batch_size=batch_size, verbose=1, nb_epoch=self.nb_epochs)


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


What happens when the estimator is instantiated with smiles strings? 

1. The argumend `smiles` is checked to make sure it is a list of strings. 
2. The smiles strings are one-hot encoded. During this process, the following things are done:

    1. The unique characters present in the smiles strings are gathered and a dictionary is created that turns every character into an integer. This includes 'G', 'E' and 'A'.
    2. All molecules are modified so that they start with 'G' and end with 'E'.
    3. The smiles strings are then split into overlapping windows of length `window_length`. These constitute the 'X' part of the data set.
    4. For each window, the character that follows that window in the smile string is also stored. These constitute the 'Y' part of the data set.
    5. Both the X and Y part of the data set are one-hot encoded and *stored* in the class.

What happens when the estimator is fit to the data?

1. The fit method receives the indices of the smiles strings to use for training.
2. There is a check to make sure that there is data (smiles strings) stored in the class. 
3. Another function converts the indices passed as an argument (which correspond to entire smiles strings) to the indices of the overlapping windows.
4. Then, the windows that are needed for training are extracted from the data set and used for the fit.

In [2]:
predictions = estimator.predict(X=idx)

What happens when the predict function is called?

1. The argument `X` of the predict function is checked. For this estimator it cannot be None. In addition, when this estimator is used in this way, only indices can be passed, not new smiles strings.
2. The smiles corresponding to the indices are extracted and so are the windows corresponding to those smiles.
3. For each *smiles*, the first window is input into the model and the next character is predicted.
4. This character is then appended to the end of the initial window while the first character is dropped. 
5. Point 4 is repeated until either the character 'E' is produced or until the predicted smile has reached a length of 100.
6. The first and last character ('G' and 'E') of the predicted smiles are removed and the smiles are returned.