<a href="https://colab.research.google.com/github/MassimilianoBiancucci/Tensorflow-exercises/blob/master/MLPs/Comprasion_between_MLPs_and_random_forest_using_tox21_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Training an MLP network to recognize if a molecule is toxic for humans


We starting to setting up our VM to get all packets needed for run our code.

In [0]:
!pip install deepchem
!pip install simdna
!pip install nosetests
!wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!time bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

In this section we download the packets needed for run a tensorboard remotly on colab and get a link at ngrok.com to access to the tensorboard trought the colab's firewall.
![alt text](https://gitcdn.xyz/cdn/Tony607/blog_statics/d425c3fe4cf0d92067572e25ae6cc3198d51936b//images/ngrok/ngrok.jpg)

In [0]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

here we launch the tensorboard in background and  set the directory for saving session log.


In [0]:
LOG_DIR = './log'
get_ipython().system_raw(
    'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'
    .format(LOG_DIR)
)

Then, we can run ngrok to tunnel TensorBoard port 6006 to the outside world. This command also runs in the background.


In [0]:
get_ipython().system_raw('./ngrok http 6006 &')

Now we get the public URL where we can access the colab TensorBoard.
It's important keep in mind that the training have to start before you can see somthing in. 


In [0]:
! curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

**Import libraries:**

we start importing **numpy**, a python library to work efficiently with tensors, folowed by **tensorflow** and  **sklearn.metrics**, the libraries that contain the toolkit for work with data, networks and in this case tools for getting the performance of model.
**Matplotlib** for display data and in the end **deepchem** that contain the tox21 dataset.

**Tox21**, is a unique collaboration between several federal agencies to develop new ways to rapidly test whether substances adversely affect human health. This dataset consists of a set of 10,000 molecules represented vectorially  tested for interaction with the androgen receptor.


In [0]:
import matplotlib.pyplot as plt
import sys
import os
sys.path.append('/usr/local/lib/python3.7/site-packages/')
import numpy as np
np.random.seed(456)
import  tensorflow as tf
tf.set_random_seed(456)
import deepchem.molnet as dc
from sklearn.metrics import accuracy_score

Now we load tox21 dataset and prepare it, removing usefull data.

In [0]:
_, (train, valid, test), _ = dc.load_tox21()
train_X, train_y, train_w = train.X, train.y, train.w
valid_X, valid_y, valid_w = valid.X, valid.y, valid.w
test_X, test_y, test_w = test.X, test.y, test.w


# Remove extra tasks
train_y = train_y[:, 0]
valid_y = valid_y[:, 0]
test_y = test_y[:, 0]
train_w = train_w[:, 0]
valid_w = valid_w[:, 0]
test_w = test_w[:, 0]


In this section we declare the structure of the tensor graph, which represents the model.

In [0]:
# Generate tensorflow graph
#general parameters
d = 1024
n_hidden = 50
n_hidden2 = 30
n_hidden3 = 10
learning_rate = .01
n_epochs = 20
batch_size = 100

#dataset
with tf.name_scope("dataset"):
  x = tf.placeholder(tf.float32, (None, d))
  y = tf.placeholder(tf.float32, (None,))
  
  
with tf.name_scope("hidden-layer1"):
  W = tf.Variable(tf.random_normal((d, n_hidden)))
  b = tf.Variable(tf.random_normal((n_hidden,)))
  x_hidden = tf.nn.relu(tf.matmul(x, W) + b)
  
with tf.name_scope("hidden-layer2"):
  W = tf.Variable(tf.random_normal((n_hidden, n_hidden2)))
  b = tf.Variable(tf.random_normal((n_hidden2,)))
  x_hidden2 = tf.nn.relu(tf.matmul(x_hidden, W) + b)
  
with tf.name_scope("hidden-layer3"):
  W = tf.Variable(tf.random_normal((n_hidden2, n_hidden3)))
  b = tf.Variable(tf.random_normal((n_hidden3,)))
  x_hidden3 = tf.nn.relu(tf.matmul(x_hidden2, W) + b)
  
with tf.name_scope("output"):
  W = tf.Variable(tf.random_normal((n_hidden3, 1)))
  b = tf.Variable(tf.random_normal((1,)))
  y_logit = tf.matmul(x_hidden3, W) + b
  
  # the sigmoid gives the class probability of 1
  y_one_prob = tf.sigmoid(y_logit)
  # Rounding P(y=1) will give the correct prediction.
  y_pred = tf.round(y_one_prob)
  
 #setting of loss function
with tf.name_scope("loss"):
  
  # Compute the cross-entropy term for each datapoint
  y_expand = tf.expand_dims(y, 1)
  entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits=y_logit, labels=y_expand)
  
  # Sum all contributions
  l = tf.reduce_sum(entropy)

#setting of optimization algorithm
with tf.name_scope("optim"):
  train_op = tf.train.AdamOptimizer(learning_rate).minimize(l)

#setting variables to show in tensorboard scalar section 
with tf.name_scope("summaries"):
  tf.summary.scalar("loss", l)
  merged = tf.summary.merge_all()

#configure folder for tensorboard data
train_writer = tf.summary.FileWriter(LOG_DIR, tf.get_default_graph())

In this section the model is trained

In [0]:
N = train_X.shape[0]
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  step = 0
  for epoch in range(n_epochs):
    pos = 0
    while pos < N:
      batch_X = train_X[pos:pos+batch_size]
      batch_y = train_y[pos:pos+batch_size]
      feed_dict = {x: batch_X, y: batch_y}
      _, summary, loss = sess.run([train_op, merged, l], feed_dict=feed_dict)
      print("epoch %d, step %d, loss: %f" % (epoch, step, loss))
      train_writer.add_summary(summary, step)
    
      step += 1
      pos += batch_size

  # Make Predictions for model evaluetion
  # for the training dataset
  train_y_pred = sess.run(y_pred, feed_dict={x: train_X})
  #for the validation dataset
  valid_y_pred = sess.run(y_pred, feed_dict={x: valid_X})
  #for the test dataset
  test_y_pred = sess.run(y_pred, feed_dict={x: test_X})

In this section we use **train_y_perd**, **valid_y_perd**, **test_y_perd** which is used for the calulation of two score for each through the sklearn function **accuracy_score**, this comand take an optional argument **sample_weight** which is set normaly at 1. The first score use the same weight for all of examples (1), in the second case we set it with the given array form the tox21 dataset.


---


**Set a correct metric**

Tox21 pass this array because we want that the model learn to recognize toxic substances, in the dataset this substances are only the 5% of the total, therefore the right way to set a metric is ballance this inequality. To do this we use mentionated array that contain a weight of 19 for toxic subtances and 1 for non-toxic substances. In this way, we have a complessive weight of toxic examples of 50% of total and a weight of 50% of total for a not-toxic examples.
Getting a balanced metric for our purposes


In [0]:
#calculation of normal score for the train dataset
train_score = accuracy_score(train_y, train_y_pred)
print("Unweighted Classification Accuracy for the training dataset: %f" % train_score)
#calculation of weighted score for the train dataset
weighted_train_score = accuracy_score(train_y, train_y_pred, sample_weight=train_w)
print("Weighted Classification Accuracy for the training dataset: %f" % weighted_train_score)

#calculation of normal score for the validation dataset
valid_score = accuracy_score(valid_y, valid_y_pred)
print("Unweighted Classification Accuracy for the validation dataset: %f" % valid_score)
#calculation of weighted score for the validation dataset
weighted_valid_score = accuracy_score(valid_y, valid_y_pred, sample_weight=valid_w)
print("Weighted Classification Accuracy for the validation dataset: %f" % weighted_valid_score)

#calculation of normal score for the test dataset
test_score = accuracy_score(test_y, test_y_pred)
print("Unweighted Classification Accuracy for the test dataset: %f" % test_score)
#calculation of weighted score for the test dataset
weighted_test_score = accuracy_score(test_y, test_y_pred, sample_weight=test_w)
print("Weighted Classification Accuracy for the test dataset: %f" % weighted_test_score)

#Training a Random Forest to recognize if a molecule is toxic for humans

Now we importing from sklearn library a usefull method for train a random forest classifier 

In [0]:
from sklearn.ensemble import RandomForestClassifier

Whit this sample function is easy create a standard machine learning algorithm and train it with the sample function **fit**, noting that the two arguments of this function is the input data first and the lables, the model do all necessary adjustment.

In [0]:
sklearnModel = RandomForestClassifier(class_weight="balanced", n_estimators=100)
sklearnModel.fit(train_X, train_y)

After training the model we can get the result from the entire validation and test dataset and store it in same variables, with the sample command **predict**.

In [0]:
train_y_pred = sklearnModel.predict(train_X)
valid_y_pred = sklearnModel.predict(valid_X)
test_y_pred = sklearnModel.predict(test_X)

Now we doing the same thing done in the precedent example for ballance the metric also for the random forset classifier.

In [0]:
#calculation of weighted score for the train dataset
weighted_score = accuracy_score(train_y, train_y_pred, sample_weight = train_w)
print("Weighted train Classification Accurancy: %f" %weighted_score)
#calculation of weighted score for the validation dataset
weighted_score =  accuracy_score(valid_y, valid_y_pred, sample_weight = valid_w)
print("Weighted validation Classification Accurancy: %f" %weighted_score)
#calculation of weighted score for the test dataset
weighted_score =  accuracy_score(test_y, test_y_pred, sample_weight = test_w)
print("Weighted test Classification Accurancy: %f" %weighted_score)