#Tutorial 1: Multi-Layer Perceptron with Keras

##Objectives:

In this tutorial you will learn how to construct a simple Multi-Layer Perceptron model with Keras. Specifically you will learn to:
* Create and add layers including weight initialization and activation.
* Compile models including optimization method, loss function and metrics.
* Fit models include epochs and batch size
* Model predictions.
* Summarize the model.

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Descriptors

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


#### Reading molecules and activity from SDF

In [2]:
fname = "data/cdk2.sdf"

mols = []
y = []
for mol in Chem.SDMolSupplier(fname):
    if mol is not None:
        mols.append(mol)
        y.append(float(mol.GetProp("pIC50")))

#### Calculate descriptors (fingerprints) and convert them into numpy array

In [3]:
# generate binary Morgan fingerprint with radius 2
fp = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]

In [4]:
def rdkit_numpy_convert(fp):
    output = []
    for f in fp:
        arr = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(f, arr)
        output.append(arr)
    return np.asarray(output)

In [5]:
x = rdkit_numpy_convert(fp)

In [6]:
# fix random seed for reproducibility
seed = 2019
np.random.seed(seed)

# randomly select 20% of compounds as test set
x_tr, x_ts, y_tr, y_ts = train_test_split(x, y, test_size=0.20, random_state=seed)

In [7]:
mol_num, feat_num = x_tr.shape
print("# molecules for training = %i, # of features = %i\n" % (mol_num, feat_num))

# molecules for training = 348, # of features = 2048



We can create Keras models and evaluate them with scikit-learn by using handy wrapper objects provided by the Keras library. This is desirable, because scikit-learn excels at evaluating models and will allow us to use powerful data preparation and model evaluation schemes with very few lines of code.

The Keras wrappers require a function as an argument. This function that we must define is responsible for creating the neural network model to be evaluated.

Below we define the function to create a simple MLP regressor that has a single fully connected hidden layer with the same number of neurons as input attributes (13). The network uses the rectifier activation function for the hidden layer. No activation function is used for the output layer because it is a regression problem and we are interested in predicting numerical values directly without transform.

The efficient ADAM optimization algorithm is used and a mean squared error loss function is optimized. This will be the same metric that we will use to evaluate the performance of the model. It is a desirable metric because by taking the square root gives us an error value we can directly understand in the context of the problem (kcal).

In [8]:

# define the first MLP regressor model
def MLP_model1(sample_num, feat_num):
	# create model
	model = Sequential()
	model.add(Dense(sample_num, input_dim=feat_num, kernel_initializer='normal', activation='relu'))
	model.add(Dense(1, kernel_initializer='normal'))
	# Compile model
	model.compile(loss='mean_squared_error', optimizer='adam')
	return model


The Keras wrapper object for use in scikit-learn as a regression estimator is called KerasRegressor. We create an instance and pass it both the name of the function to create the neural network model as well as some parameters to pass along to the fit() function of the model later, such as the number of epochs and batch size.

We also initialize the random number generator with a constant random seed, a process we will repeat for each model evaluated in this tutorial. This is an attempt to ensure we compare models consistently.

In [9]:
# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=MLP_model1, sample_num=mol_num, feat_num=feat_num, epochs=10, batch_size=2, verbose=0)

The final step is to evaluate this baseline model. We will use 10-fold cross validation to evaluate the model.

In [14]:
def kendalls_tau(estimator, X, y):
    from scipy.stats import kendalltau, pearsonr
    preds = estimator.predict(X)
    t = kendalltau(preds, y)[0]
    return t

scorer = {'r2':'r2', 'MSE':'mean_squared_error'}
    

kfold = KFold(n_splits=2, random_state=seed)
results = cross_val_score(estimator, x, y, scoring=scorer, cv=kfold)
print results
#print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))

ValueError: scoring value should either be a callable, string or None. {'tau': <function kendalls_tau at 0x2b6b96b856e0>, 'MSE': 'mean_squared_error', 'r2': 'r2'} was passed

Running this code gives us an estimate of the model’s performance on the problem for unseen data. The result reports the mean squared error including the average and standard deviation (average variance) across all 10 folds of the cross validation evaluation.

