# Lab 10. Neural networks

#### Table of contents

1. Overview
2. The QM7 dataset
3. Prepare the data
4. Implementation
5. Learning rate
6. Regularization


## 1. Overview

In this lab session we will improve the code we developed in Lab 9 to optimize a neural network. We will explore the learning rate and regularization hyperparameters.

## 2. The QM7 dataset

This dataset is an extension of the QM7 dataset for multitask learning where 13 additional properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW). Additional molecules comprising chlorine atoms are also included, totalling 7211 molecules.
The dataset is composed of two multidimensional arrays $X$ ($7211\times 23\times 23$) and $T$ ($7211\times 14$) representing the inputs (Coulomb matrices) and the labels (molecular properties) and one array names of size 14 listing the names of the different properties.
More details are provided in this [paper](https://iopscience.iop.org/article/10.1088/1367-2630/15/9/095003/meta).

Basically, the datatset contains features to describe some small molecules (these features are called Coulomb matrices) and various molecular properties (14) as follow:

1. Atomization energies (PBE0, unit: kcal/mol)
2. Excitation of maximal optimal absorption (ZINDO, unit: eV)
3. Absorption Intensity at maximal absorption (ZINDO)
4. Highest occupied molecular orbital HOMO (ZINDO, unit: eV)
5. Lowest unoccupied molecular orbital LUMO (ZINDO, unit: eV)
6. First excitation energy (ZINDO, unit: eV)
7. Ionization potential IP (ZINDO, unit: eV)
8. Electron affinity EA (ZINDO, unit: eV)
9. Highest occupied molecular orbital HOMO (PBE0, unit: eV)
10. Lowest unoccupied molecular orbital LUMO (PBE0, unit: eV)
11. Highest occupied molecular orbital HOMO (GW, unit: eV)
12. Lowest unoccupied molecular orbital LUMO (GW, unit: eV)
13. Polarizabilities (PBE0, unit: $A^3$)
14. Polarizabilities (SCS, unit: $A^3$)

Because these properties are complicated to compute, methods based on machine learning can be trained to predict them based on some meaningfull features. Coulomb matrices are such good representations.

A Coulmb matrix is defined based on the atomic positions $R_i$ and atomic charges $Z_i$ of atoms in a molecule as:

$M_{IJ}=\left\{
\begin{array}{ll}
0.5Z_I^{2.4}\text{ for }I=J\\
\frac{Z_IZ_J}{|R_I-R_J|}\text{ for }I\neq J\\
\end{array}
\right.
$

Here, the Coulomb matrices are already computed and provided in the training set.

## 3. Prepare the data

Let's first load the data and reshape it into 2D arrays (this was explained in Lab 8). 

In [1]:
from scipy.io import loadmat
qm7 = loadmat('qm7b.mat')

Below we set the input and output variables. To speedup the neural network optimization, we will select only the first 1500 examples in the dataset. Note that changing the size of the dataset can have dramatic effects on the results and parameters discussed below. You should keep this number to answer all questions.

In [2]:
import numpy as np

xsize = 1500
X0 = qm7['X']
X = X0.reshape(7211,529)
X = np.c_[X[:xsize]]
print(X.shape)

(1500, 529)


In [3]:
y = qm7['T'][:,0]*0.043
y = np.c_[y[:xsize]]
print(y.shape)

(1500, 1)


We split the dataset into 80% training, 10% validation and 10% testing. Then, we standardize the data. Note that standardization is defined on the training data and then applied to transform the validation data.

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state=31)
X_test, X_val, y_test, y_val = train_test_split(X_test,y_test,test_size = 0.5,random_state=32)
print(X_train.shape,X_test.shape,X_val.shape)

X_scaler = StandardScaler(with_mean=False,with_std=False).fit(X_train)
X_train = X_scaler.transform(X_train)
X_val = X_scaler.transform(X_val)

(1200, 529) (150, 529) (150, 529)


## 4. Implementation

__Q.1.__ Implement in the code block below l1 regularization. You should look at the regular implementation (without regularization) in the code of Lab 9 and modified it to include regularization. You must only assign values for the variable `self.weights` (3 marks).

__Q.2.__ Similarly, implement in the code block below l2 regularization (3 marks).

In [5]:
import random
import numpy as np
from numpy.random import RandomState
import time
from sklearn.utils import shuffle

class Network(object):

    def __init__(self, sizes, reg='l1'):
        prng = RandomState(33) # seed for random numbers
        self.num_layers = len(sizes)        
        self.sizes = sizes
        self.reg = reg # variable for regularization technique
        self.biases = [prng.randn(y, 1) for y in sizes[1:]]
        self.weights = [prng.randn(y, x)/np.sqrt(x) for x, y in zip(sizes[:-1], sizes[1:])]
                
    def feedforward(self, a):  
        a_list = [a]
        z_list = []
        for b, w in zip(self.biases[:-1], self.weights[:-1]):
            z = np.dot(w, a)+b
            z_list.append(z)
            a = tanh(z)
            a_list.append(a)
        z = np.dot(self.weights[-1], a)+self.biases[-1]
        z_list.append(z)
        a_list.append(z)
        return a_list,z_list

    def SGD(self, X, y, X_test, y_test, hyper_params):
        # We get the hyper-parameters
        epochs, mini_batch_size, alpha, lmbda = hyper_params
        rmse, y_pred = self.evaluate(X,y)
        rmse_test, y_pred_test = self.evaluate(X_test,y_test)
        print("Epoch {:3d} complete Train {:.4f} eV Test {:.4f} eV".format(0,rmse,rmse_test))
        m,n = X.shape
        rmse_list, rmse_test_list = [],[]
        # Loop over epochs
        for j in range(epochs):
            t0 = time.time()
            total_batch = int(m/mini_batch_size)
            # Loop over batches
            for k in range(total_batch):
                offset = k*mini_batch_size
                Xi = X[offset:offset+mini_batch_size]
                Yi = y[offset:offset+mini_batch_size]
                # Update weights and biases
                self.update_mini_batch(Xi,Yi,alpha,lmbda,m)
            if (j+1) % 1 == 0:
                rmse, y_pred = self.evaluate(X,y)
                rmse_list.append(rmse)
                t = time.time()
                rmse_test, y_pred_test = self.evaluate(X_test,y_test)
                rmse_test_list.append(rmse_test)
                print("Epoch {:3d} complete Train {:.4f} eV Test {:.4f} eV @{:.3f}s".format(j+1,rmse,rmse_test,t-t0))
            else: 
                t = time.time()
                print("Epoch {:3d} complete @{:.3f}s".format(j+1,t-t0))
        return rmse_list, rmse_test_list, y_pred, y_pred_test

    def update_mini_batch(self, Xi, Yi, alpha, lmbda, m):
        # Create arrays filled with zeros
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        mi,ni = Xi.shape
        # Loop over examples in the mini batch
        for i in range(mi):
            # Backprop
            delta_nabla_b, delta_nabla_w = self.backprop(np.c_[Xi[i]], Yi[i])
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        # Update weights and biases via GD based on all examples in the mini batch
        if self.reg == 'l1':
            ### Q.1. BEGIN SOLUTION
            ### Q.1. END SOLUTION
        if self.reg == 'l2':
            ### Q.2. BEGIN SOLUTION
            ### Q.2. END SOLUTION
        self.biases = [b-(alpha/mi)*nb for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, xi, yi):
        # Initialize arrays
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # Forward path
        a_list,z_list = self.feedforward(xi)
        delta = self.J_prime(a_list[-1], yi) # BP1
        nabla_b[-1] = delta # BP3
        nabla_w[-1] = np.dot(delta, a_list[-2].transpose()) # BP4
        # Backpropagate
        for l in range(2, self.num_layers):
            z = z_list[-l]
            sp = tanh_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            # Compute dJ/db of each layer (BP3)
            nabla_b[-l] = delta
            # Compute dJ/dw of each layer (BP4)
            nabla_w[-l] = np.dot(delta, a_list[-l-1].transpose())
        return nabla_b, nabla_w
    
    def evaluate(self, X_val, y_val):
        # Feedforward and compute RMSE
        m_val,n_val = X_val.shape
        a_list,z_list = self.feedforward(np.c_[X_val].T)
        rmse = np.sqrt(np.sum((a_list[-1]-np.c_[y_val].T)**2)/m_val)
        # Returns RMSE value and list of output values over arguments data X_val/y_val
        return rmse,a_list[-1]
    
    def J_prime(self, hi, y):
        return hi-y

# Some activation functions and their respective derivatives
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def tanh(z):
    return 2*sigmoid(2*z)-1

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))

def tanh_prime(z):
    return 1-tanh(z)**2

## 5. Learning rate

We would like first to find a good learning rate corresponding to the data we have. We will use a feedforward neural network with 2 hidden layers each of 100 neurons. In a first time, we wont regularize.

__Q.3.__ Define the list `alphas` of learning rate corresponding to values of 0.01, 0.001, 0.0001 and 0.00001 and perform gradient descent for 40 epochs with a minibatch size of 10 and without regularization (define the valiables `epochs`, `mini_batch_size` and `lmbda`). Once you have defined the variables, you can execute the code block to perform GD over the various learning rates (1 mark).

In [None]:
### BEGIN SOLUTION
### END SOLUTION

err, err_val  = [],[]
for alpha in alphas:
    print('>>> Alpha =',alpha)
    net = Network([X_train.shape[1],100,100,1])
    hyper_params = epochs, mini_batch_size, alpha,lmbda
    rmse_list,rmse_list_test,y_predict, y_predict_test = net.SGD(X_train,y_train,X_val,y_val,hyper_params) 
    err.append(rmse_list)
    err_val.append(rmse_list_test)

To select the optimal learning rate, we can plot the RMSE of the training and validation datasets for the various learning rates explored. For better selection, we print the averaged values of the RMSE over the last 10 epochs (10-RMSE).

In [None]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 5))
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)
for i in range(len(err)):
    ax1.plot(err[i],marker = '.',ms=0,ls='-',lw=4,label='alpha '+str(alphas[i]))
    ax2.plot(err_val[i],marker = '.',ms=0,ls='-',lw=4,label='alpha '+str(alphas[i]))
    print('>>> Alpha=',alphas[i],' 10-RMSE=',np.mean(err[i][-10:]),np.mean(err_val[i][-10:]),'eV')

ax1.set_ylim(1,10)
ax2.set_ylim(4,10)
ax1.set_xlabel('epoch',fontsize=22)
ax2.set_xlabel('epoch',fontsize=22)
ax1.set_ylabel('RMSE (eV)',fontsize=22)
ax1.set_title('Training set',fontsize=22)
ax2.set_title('Validation set',fontsize=22)
plt.legend(ncol=2)
plt.tight_layout()
#plt.savefig('figure.pdf')
plt.show()

Based on the previous plots, we can select the optimal learning rate corresponding to that leading to the lowest (10-RMSE) validation error.

## 6. Regularization

We will now select the optimal regularization parameter while keeping the learning rate to the value previously selected.

__Q.4.__ Define the list `lmbdas` of regularization parameters corresponding to values of 0, 10, 1, 0.1, 0.01 and perform gradient descent for 40 epochs with a minibatch size of 10 and learning rate previously optimized (define the valiables `epochs`, `mini_batch_size` and `alpha`). We will only consider l1 regularization (1 mark).

In [None]:
### BEGIN SOLUTION
### END SOLUTION

err, err_val  = [],[]
for lmbda in lmbdas:
    print('>>> Lambda =',lmbda)
    net = Network([X_train.shape[1],100,100,1])
    hyper_params = epochs, mini_batch_size, alpha,lmbda
    rmse_list,rmse_list_test,y_predict, y_predict_test = net.SGD(X_train,y_train,X_val,y_val,hyper_params) 
    err.append(rmse_list)
    err_val.append(rmse_list_test)

To select the optimal ragularization parameter, we can plot the RMSE on the training and validation data. For better selection, we print the averaged values of the RMSE over the last 10 epochs (10-RMSE).

In [None]:
plt.figure(figsize=(12, 5))
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)
for i in range(len(err)):
    ax1.plot(err[i],marker = '.',ms=0,ls='-',lw=4,label='lambda '+str(lmbdas[i]))
    ax2.plot(err_val[i],marker = '.',ms=0,ls='-',lw=4,label='lambda '+str(lmbdas[i]))
    print('>>> Lambda=',lmbdas[i],' 10-RMSE=',np.mean(err[i][-10:]),np.mean(err_val[i][-10:]),'eV')

ax1.set_ylim(1.5,4)
ax1.set_xlim(20,41)
ax2.set_ylim(4,6)
ax2.set_xlim(20,41)
ax1.set_xlabel('epoch',fontsize=22)
ax2.set_xlabel('epoch',fontsize=22)
ax1.set_ylabel('RMSE (eV)',fontsize=22)
ax1.set_title('Training set',fontsize=22)
ax2.set_title('Validation set',fontsize=22)
plt.legend(ncol=2)
plt.tight_layout()
#plt.savefig('figure.pdf')
plt.show()

Based on the previous plots, we can select the optimal regularization parameter corresponding to that leading to the lowest (10-RMSE) validation error.

__Q.5.__ Assign the variables `lmbda` and `alpha` leading to smallest (10-epoch averaged) validation error. Just report the values you have selected previously (2 marks).

In [6]:
### BEGIN SOLUTION
### END SOLUTION

We can now perform gradient descent over 100 epochs with the optimal values.

In [7]:
epochs, mini_batch_size = 100, 10
net = Network([X_train.shape[1],100,100,1])
hyper_params = epochs, mini_batch_size, alpha,lmbda
rmse_list,rmse_list_test,y_predict,y_predict_test = net.SGD(X_train,y_train,X_val,y_val,hyper_params) 

Epoch   0 complete Train 64.6352 eV Test 63.0658 eV
Epoch   1 complete Train 24.4468 eV Test 23.2667 eV @0.367s
Epoch   2 complete Train 13.5417 eV Test 13.3540 eV @0.346s
Epoch   3 complete Train 12.1148 eV Test 12.5226 eV @0.347s
Epoch   4 complete Train 11.8128 eV Test 12.4237 eV @0.344s
Epoch   5 complete Train 10.6796 eV Test 11.2918 eV @0.386s
Epoch   6 complete Train 9.1471 eV Test 9.5945 eV @0.326s
Epoch   7 complete Train 8.0417 eV Test 8.6487 eV @0.322s
Epoch   8 complete Train 7.3207 eV Test 8.2213 eV @0.321s
Epoch   9 complete Train 6.7683 eV Test 7.6870 eV @0.322s
Epoch  10 complete Train 6.3633 eV Test 7.3581 eV @0.323s
Epoch  11 complete Train 5.8644 eV Test 6.9188 eV @0.325s
Epoch  12 complete Train 5.4530 eV Test 6.6285 eV @0.402s
Epoch  13 complete Train 5.1463 eV Test 6.4615 eV @0.320s
Epoch  14 complete Train 4.8910 eV Test 6.3074 eV @0.326s
Epoch  15 complete Train 4.5741 eV Test 6.0498 eV @0.324s
Epoch  16 complete Train 4.5079 eV Test 6.0484 eV @0.319s
Epoch  17 

Finally, we can appreaciate the accuracy of the model developed by ploting the actual values of the energy as a function of the predicted values for the training and validation sets.

In [None]:
plt.figure(figsize=(12, 5))
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)

ax1.plot(y_train[:xsize],y_predict.T,'.',lw=0)
ax2.plot(y_val[:xsize],y_predict_test.T,'.',lw=0,c='g')
ax1.plot(y_train[:xsize],y_train[:xsize],lw=1,label='y=x')
ax2.plot(y_val[:xsize],y_val[:xsize],lw=1,label='y=x')

ax1.set_xlabel('energies (eV)',fontsize=22)
ax2.set_xlabel('energies (eV)',fontsize=22)
ax1.set_ylabel('predicted (eV)',fontsize=22)
ax1.set_title('Training set',fontsize=22)
ax2.set_title('Validation set',fontsize=22)
plt.legend(ncol=2)
plt.tight_layout()
#plt.savefig('figure.pdf')
plt.show()