# Predicting molecular properties with PiNN

[PiNN](https://github.com/Teoroo-CMC/PiNN) is a Python library we have developed to build atomic neural networks. We implemented both the GCNN based network (called PiNet) and the representation based 
BPNN.

In this notebook we'll demonstrate how to predict properties of molecules with PiNN.

Here we use the QM9 dataset [doi:10.6084/m9.figshare.978904](https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904).
A collection of quantum chemical calculations of 134 kilo organic molecules, which is a subset of the GDB-17 chemical universe of 166 billion organic molecules.


### A table of the properties

   |Property|Unit       |Description|
   |:-------|:----------|:-------------|
   |A       |GHz        |Rotational constant A|
   |B       |GHz        |Rotational constant B|
   |C       |GHz        |Rotational constant C|
   |mu      |Debye      |Dipole moment|
   |alpha   |Bohr^3     |Isotropic polarizability|
   |homo    |Hartree    |Energy of Highest occupied molecular orbital (HOMO)|
   |lumo    |Hartree    |Energy of Lowest occupied molecular orbital (LUMO)|
   |gap     |Hartree    |Gap, difference between LUMO and HOMO|
   |r2      |Bohr^2     |Electronic spatial extent|
   |zpve    |Hartree    |Zero point vibrational energy|
   |U0      |Hartree    |Internal energy at 0 K|
   |U       |Hartree    |Internal energy at 298.15 K|
   |H       |Hartree    |Enthalpy at 298.15 K|
   |G       |Hartree    |Free energy at 298.15 K|
   |Cv      |cal/(mol K)|Heat capacity at 298.15 K|


## Installing requirements

To run this module you need to install PiNN and PiNNboard: the libraries we developed for building atomic neural networks and to visualize them. To do so, run the next block.

The block also downloads a copy of the QM9 dataset to your runtime.
*We re-centered all the properties to zero to avoid numerical difficulties during training, you are provided 20000 data points for you to run the training, and 20000 more to run the validation/testing.*

In [None]:
# hide some noisy warnings before we start
import warnings, logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
warnings.filterwarnings('ignore', 'Converting sparse IndexedSlices')
!pip install tensorboard==2.4
!pip install --quiet git+https://github.com/yqshao/PiNN.git@TF2
!pip install --quiet git+https://github.com/yqshao/PiNNboard.git@TF2
!wget -nv https://raw.githubusercontent.com/yqshao/PiNNLab/master/resources/qm9_{train,test}.{yml,tfr}

## Navigating the dataset

In [None]:
from pinn.io import load_tfrecord, sparse_batch

train_set = load_tfrecord('qm9_train.yml')
vali_set = load_tfrecord('qm9_test.yml')

In TensorFlow, a dataset can be seen as an "iterator"
of datum, you can write like

``` Python
for data in dataset:
    ...
```

To see only one of the data:

In [None]:
for data in train_set:
    print(data)
    break

---
You can see that one data is a dictionary with all the properties we have, as tensors.  

To see it more clearly, run this:

In [None]:
data.keys()

## Preprocessing for training

Before the training, we need to label that data so that TensorFlow knows that we want to train. The following code block labels the "U0" as the training target
and batches the dataset. Make sure you know what it does, later you can reuse the code to select other properties to train.

In [None]:
def label_data(data):
    x = data
    y = data['U0']
    return x, y

train = train_set.apply(sparse_batch(30)).map(label_data)
vali = vali_set.apply(sparse_batch(30)).map(label_data)

## Defining an atomic neural network

### PiNet

PiNet is the GCNN architectures implemented in PiNN. Its structure is controlled by the
four neural networks, each aiming for different purposes. 

- PI Layers transforms the atomic properties of a pair to a pairwise interaction
- II Layers transforms pairwise interactions to pairwise interactions
- PP Layers transforms atom-wise properties to atom-wise properties
- Output Layers performs the actual prediction

![image.png](https://github.com/yqshao/PiNNLab/raw/master/resources/pinet.png)

The networks are defined by the structure of four feed-forward neural networks, each is specified by a 
list of hidden units.
- The `depth` parameter controls how many times this "Graph Convoluton Operation" is repeated

### The `out_pool` parameter

Recall that both BPNN and PiNet make predictions on atoms, it remains a problem how we make molecular predictions from atomic ones. Four options are available for both network:

- `'sum'`: sum over atomic predictions
- `'min'`: minimum of atomic predictions
- `'max'`: maximum of atomic predictions
- `'avg'`: average of atomic predictions

Depending on what you want to predict one may choose one of them, for example, when predicting total energies it makes sense to sum up atomic predictions, but for HOMO it might be a better idea to use `'max'`.

### Fitting an atomic neural network

With PiNN you can create atomic neural networks as Keras models, define and fit a PiNet with the following code block.

We used a relatively small model here and trained or less steps 
for the purpose of demonstration, feel free to increase them.

In [None]:
from pinn.networks.pinet import PiNet

pinet = PiNet(pp_nodes=[6], ii_nodes=[6,6], pi_nodes=[6,6], out_nodes=[3], 
              depth=2, out_pool='sum')
pinet.compile(optimizer='Adam', loss='MAE')
pinet.fit(train, epochs=3)

### Visualizing a network 

PiNN provides a tool called PiNNboard to visualize the activation and weights of an ANN.
To use PiNNboard, you create a callback just like the case of TensorBoard.

Here we visualize the neural network with the first 30 samples in the training set.

In [None]:
from tensorboard_plugin_pinnboard.summary import PiNNBoardCallback
from tensorflow.keras.callbacks import TensorBoard

logdir = 'logs/PiNet'
tb_cbk = TensorBoard(log_dir=logdir, write_graph=True)
pb_cbk = PiNNBoardCallback(logdir, train_set.apply(sparse_batch(30)))

pinet = PiNet(pp_nodes=[6], ii_nodes=[6,6], pi_nodes=[6,6], out_nodes=[3], 
              depth=2, out_pool='sum')

pinet.compile(optimizer='Adam', loss='MAE')
pinet.fit(train, epochs=3, validation_data=vali, callbacks=[tb_cbk, pb_cbk])

Now run the next block. A TensorBoard will start and you can select your network
to visualize it. 

- Try to understand what the connections mean.
- Move the slider to see how the weights and activations changes during the training.

In [None]:
%load_ext tensorboard
%tensorboard --logdir logs --samples_per_plugin PiNNboard=50

**TASK**

Pick one of the properties in the QM9 dataset to predict

- Train a PiNet for the select property
- Visualized the trained model using TensorBoard
- Change the `out_pool` parameter to see its impact
- Try to improve the performance by tweaking the network parameters

Answer the questions
    
- Does the activations of nodes make sense?
  - Do they distinguish atoms meaningfully?
  - Can you recognize any interaction they learnt?
- How does `out_pool` affect your prediction?
- What is your best training performance, how do you get it?


As a reference this is the mean absolute error of different machine learning models' U0 prediction
on the QM9 dataset, as a function of training data (we are using  $2\times 10^4$ samples from the dataset here), can you 
find your position in it?

![image.png](https://github.com/yqshao/PiNNLab/raw/master/resources/qm9_benchmarks.png)

#  BPNN

Another network implemented in PiNN is the Behler Parrinello Neural Network (BPNN).
The network is specified by the setup of symmetry functions and element specified 
neural networks. Below is an example setup. You can read more about the definition 
of symmetry functions in this [paper](https://onlinelibrary.wiley.com/doi/full/10.1002/qua.24890) and in the [PiNN documentation](https://teoroo-pinn.readthedocs.io/en/latest/networks/bpnn.html).

- Try to run the following block which will train a BPNN with a minimal set of 
   symmetry functions.
- Visualize the network in PiNNboard and see the structure of BPNN.

**BONUS:**
can you find a good BPNN setup that predicts your molecular property well?
- In the example we do not split the elements when we define the symmetry function,
- you can improve this by tailoring the symmetry function for different (pairs) of elements.

In [None]:
from pinn.networks import BPNN

bpnn = BPNN(
    sf_spec=[{'type':'G2', 
              'i': 'ALL', 'j': 'ALL', 
              'Rs': [1.,1.5,2.], 'eta': [0.2, 0.5, 1.0]}],
    nn_spec={6: [3, 3, 3], 1: [3, 3, 3]} ,
    out_pool='sum')

logdir = 'logs/BPNN'
tb_cbk = TensorBoard(log_dir=logdir, write_graph=True)
pb_cbk = PiNNBoardCallback(logdir, train_set.apply(sparse_batch(30)))

bpnn.compile(optimizer='Adam', loss='MAE')
bpnn.fit(train, epochs=3, validation_data=vali, callbacks=[tb_cbk, pb_cbk])