Setting up a NN Workflow
======

This is where I am taking the puzzle pieces I learned from `deep_learning_keras.ipynb` and applying them to make
moves towards deploying NN in our code base.  I am using a notebook because I still want the visuals for explaining
the process and seeing the outputs.  Once the code is ironed out, we will move it to the `core` codebase.
***
## Objectives
1. Import data and featurize with RDKit and/or ECFP
1. Create functions for defining basic neural networks with minimum manual repitition.
2. Set up procedures to optimize the model using Bayesian methods for hyperparamters.
3. Implement callbacks for saving the model and rolling back to the best version.
5. Visualize the results of the model using `matplotlib` and Tensorboard.
6. Export stats and results similarly to how we have done it in the past.

In [1]:
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

### Function to define neural network
Using the Sequential API to make architectures with square or pyramid shape.  Pyramids will be wide at the input and narrow at the
output.

How can I handle the variable input shapes?  You really need to define it, I think. If I make it a requirement,
how does that affect wrapping it up in sklearn?

In [2]:
def build_nn(n_hidden = 2, n_nueron = 50, learning_rate = 1e-3, in_shape=[8]):
    model = keras.models.Sequential()
    model.add(keras.layers.InputLayer(input_shape=in_shape))  # input layer.  How to handle shape?
    for layer in range(n_hidden):  # create hidden layers
        model.add(keras.layers.Dense(n_nueron, activation="relu"))
    model.add(keras.layers.Dense(1))  # output layer
    optimizer = keras.optimizers.SGD(lr=learning_rate)  # this is a point to vary.  Dict could help call other ones.
    optimizer = keras.optimizers.Adam(learning_rate=learning_rate)
    # optimizer = keras.optimizers.RMSprop(learning_rate=learning_rate)
    model.compile(loss="mse", optimizer=optimizer)
    return model

Some thoughts on the above function.  I could add an option for specifying the overall shape, for example `"rect"` or
`"triangle"`. This would impact how we taper the number of neurons as the layers get deeper.

It would be nice to be able to play with the activation function from outside the function.  Could be dictionary use.
Similarly, it would be nice to be able to optimize the optimizer but retain the granular tuning provided by calling
the specific `keras.optimizers`.  Again, this could be a nice place for calling a function from dictionary.
One challenge with this would be passing parameters to it.  I think most will have a learning rate parameter, but other
than that, how would we automatically tune such parameters?  That may be too advanced for this scope of work.

### Import and process data
Use existing processes to featurize and split some data for use with the models.



In [3]:
from core.ingest import load_smiles
from core.features import targets_features, featurize
from core.misc import cd

Load the data using our `ingest.py`

In [4]:
# specific to 18k-logP
data = {'18k-logP.csv':'logp'}
dataset = '18k-logP.csv'

with cd('../dataFiles/'): # move to dataset directory
    df, exp = load_smiles(dataset, data[dataset])

df.head(5)

Unnamed: 0,smiles,logp
0,Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14,3.54
1,COc1cc(OC)c(cc1NC(=O)CSCC(=O)O)S(=O)(=O)N2C(C)...,-1.18
2,COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl,3.69
3,OC[C@H](O)CN1C(=O)C(Cc2ccccc12)NC(=O)c3cc4cc(C...,3.37
4,Cc1cccc(C[C@H](NC(=O)c2cc(nn2C)C(C)(C)C)C(=O)N...,3.1


Featurize the dataset using our `features.featurize()`.  I don't think it really matters at this point what algorithm
name gets passed.  I will handle the normalization/scaling manually.

In [5]:
df, num_feat, feat_time = featurize(df, 'rf', [0])

You have selected the following featurizations:    rdkit2d
Calculating features... Done.


In [6]:
df = df.drop("RDKit2D_calculated", axis=1)
df.head(5)

Unnamed: 0,smiles,logp,BalabanJ,BertzCT,Chi0,Chi0n,Chi0v,Chi1,Chi1n,Chi1v,...,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea,qed
0,Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14,3.54,1.420544,832.199002,16.518297,13.821155,14.577084,11.70351,8.33764,8.715604,...,0,0,0,0,0,0,0,0,0,0.728444
1,COc1cc(OC)c(cc1NC(=O)CSCC(=O)O)S(=O)(=O)N2C(C)...,-1.18,2.020016,1151.4285,24.172998,18.530293,20.163286,15.683108,10.164097,12.758861,...,1,0,0,0,0,0,0,0,0,0.545587
2,COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl,3.69,1.943224,655.231463,14.819626,11.712695,13.285121,10.202709,6.819775,8.077393,...,0,0,0,0,0,0,1,0,0,0.807761
3,OC[C@H](O)CN1C(=O)C(Cc2ccccc12)NC(=O)c3cc4cc(C...,3.37,1.572408,1015.409752,19.836134,14.684473,16.256898,13.456729,8.731046,9.925507,...,0,0,0,0,0,0,1,0,0,0.50665
4,Cc1cccc(C[C@H](NC(=O)c2cc(nn2C)C(C)(C)C)C(=O)N...,3.1,2.236144,902.250256,20.896977,17.036453,17.036453,13.112411,9.171258,9.171258,...,0,0,0,0,0,0,0,0,0,0.747686


Split the data.  I am unsure if we need a specific validation set because the goal is to use cross validation (CV),
which does not require validation data to be set aside.  However, I will leave the validation set as is for
trouble-shooting purposes for now.

In [7]:
# split up the data using 20% for testing
train_features_full, test_features, train_target_full, test_target, feature_list = targets_features(df, data[dataset])

# get validation data from training data
train_features, val_features, train_target, val_target = train_test_split(train_features_full,train_target_full)

Scale data using sklearn `StandardScalar()`

In [8]:
# scale feature vectors
scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)
val_features = scaler.transform(val_features)
test_features = scaler.transform(test_features)

In [9]:
train_features

array([[ 5.18823133e-01, -7.35738823e-01, -7.19109954e-01, ...,
        -2.30940863e-01, -2.19004923e-01,  1.54082284e+00],
       [-1.09342893e-01, -1.67709926e-01, -1.78070889e-01, ...,
        -2.30940863e-01, -2.19004923e-01, -7.97531612e-01],
       [ 7.98740772e-01, -6.68581362e-01, -1.19313436e+00, ...,
        -2.30940863e-01, -2.19004923e-01, -4.53744954e-02],
       ...,
       [-7.46803177e-01,  1.89667914e+00,  8.14169883e-01, ...,
        -2.30940863e-01, -2.19004923e-01, -2.55714144e-01],
       [ 6.74171762e-01, -6.63235587e-01, -5.75388122e-01, ...,
        -2.30940863e-01,  4.08913478e+00,  1.09026939e+00],
       [ 2.77408859e-01, -3.77320805e-03, -2.94850484e-01, ...,
        -2.30940863e-01, -2.19004923e-01,  1.45963097e+00]])

In [10]:
val_features


array([[-0.25038877, -1.63079988, -1.66437228, ..., -0.23094086,
        -0.21900492, -0.95265462],
       [ 0.46991987,  0.29314979,  1.34696987, ..., -0.23094086,
         4.08913478, -1.86757004],
       [ 0.38429601,  0.38271791, -0.50837346, ..., -0.23094086,
        -0.21900492, -0.10171211],
       ...,
       [-1.15343485,  0.47190339,  0.19230411, ..., -0.23094086,
        -0.21900492, -1.09598931],
       [-0.88809887,  1.73929472,  1.89212848, ..., -0.23094086,
        -0.21900492, -1.16177978],
       [-0.49936002, -0.06251475, -0.01431285, ..., -0.23094086,
        -0.21900492,  1.31186832]])

### Deploy the model and data

The input shape is the part that I find a bit tricky, but lets see how it goes.

In [11]:
model = build_nn(n_hidden=50, n_nueron=300, in_shape=train_features.shape[1:], learning_rate=0.001)

# set a checkpoint file to save the model
chkpt_cb = keras.callbacks.ModelCheckpoint('test.h5', save_best_only=True)
# set up early stopping callback to avoid wasted resources
stop_cb = keras.callbacks.EarlyStopping(patience=10,  # number of epochs to wait for progress
                                        restore_best_weights=True)

In [12]:
history = model.fit(train_features, train_target, epochs=50,
                    validation_data=(val_features,val_target),
                    callbacks=[chkpt_cb, stop_cb])

mse_test = model.evaluate(test_features, test_target)
print('\nTest MSE {:.4f}'.format(mse_test))

Train on 11016 samples, validate on 3673 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50

Test MSE 2.8814
