<a href="https://colab.research.google.com/github/schwallergroup/ai4chem_course/blob/scikit_learn/notebooks/02%20-%20Supervised%20Learning/training_and_evaluating_ml_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3 tutorial - AI 4 Chemistry

## Table of content

1. Supervised deep learning.
2. Inductive biases.
3. Training neural networks.
4. Model selection and optimization.

In [1]:
# Install all libraries
!pip install numpy scipy

# Download all data
!mkdir data/
!wget https://raw.githubusercontent.com/schwallergroup/ai4chem_course/scikit_learn/notebooks/02%20-%20Supervised%20Learning/data/esol.csv -O data/esol.csv
!wget https://raw.githubusercontent.com/schwallergroup/ai4chem_course/scikit_learn/notebooks/02%20-%20Supervised%20Learning/data/toxcast_data.csv -O data/toxcast_data.csv



# 1. Supervised Deep Learning

From last session we should already be familiar with supervised learning: is a type of machine learning that involves training a model on a labeled dataset to learn the relationships between input and output data.

The models we saw so far are fairly easy and work well in some scenarios, but sometimes it's not enough. What to do in these cases?


<div align="center">
<img src="img/deeper_meme.png" width="500"/>
</div>

### Deep Learning
Deep learning is a subset of machine learning that involves training artificial neural networks to learn from data. Unlike traditional machine learning algorithms, which often rely on hand-crafted features and linear models, deep learning algorithms can automatically learn features and hierarchies of representations from raw data. This allows deep learning models to achieve state-of-the-art performance on a wide range of tasks in chemistry, like molecular property prediction, reaction prediction and retrosynthesis, among others.

#### Data: Let's go back to the [ESOL dataset](https://pubs.acs.org/doi/10.1021/ci034243x) from last week.
We will use this so we can compare our results with the previous models. We'll reuse last week's code for  data loading and preprocessing.

In [2]:
import pandas as pd
from torch.utils.data import DataLoader

# load dataset from the CSV file
esol_df = pd.read_csv('data/esol.csv')

# Get NumPy arrays from DataFrame for the input and target
smiles = esol_df['smiles'].values
y = esol_df['log solubility (mol/L)'].values

# Here, we use molecular descriptors from RDKit, like molecular weight, number of valence electrons, maximum and minimum partial charge, etc.
from deepchem.feat import RDKitDescriptors
featurizer = RDKitDescriptors()
features = featurizer.featurize(smiles)
print(f"Number of generated molecular descriptors: {features.shape[1]}")

# Drop the features containing invalid values
import numpy as np
features = features[:, ~np.isnan(features).any(axis=0)]
print(f"Number of molecular descriptors without invalid values: {features.shape[1]}")

Skipped loading some Tensorflow models, missing a dependency. No module named 'tensorflow'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/home/andres/anaconda3/envs/ai4chem/lib/python3.8/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading some Jax models, missing a dependency. No module named 'jax'


Number of generated molecular descriptors: 208
Number of molecular descriptors without invalid values: 208


In [3]:
# Data preprocessing
from sklearn.model_selection import train_test_split
X = features
# training data size : test data size = 0.8 : 0.2
# fixed seed using the random_state parameter, so it always has the same split.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=0)


from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)

# save original X
X_train_ori = X_train
X_test_ori = X_test
# transform data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## 1.1 Neural Networks

Neural Networks are a type of machine learning model that is designed to simulate the behavior of the human brain.\
They consist of layers of interconnected nodes, and each node applies a `linear function` to its inputs. Non-linear activation functions are used to introduce `non-linearity` into the model, allowing it to learn more complex patterns in the data.

In [4]:
import os
import torch
from torch import nn
import torch.nn.functional as F
import pytorch_lightning as pl

### 1.2. Creating a deep learning model. TODO

#### 1.2.1. Classes: We can define a class to represent abstract objects that have properties and that do things.

We can define a `Dog` class like this:

```python
class Dog:
    def __init__(self):
        self.color = "brown"
        self.weight = 43
        
    def bark(self):
        print("Woof!")
```

In this example, a dog has two properties: `color` and `weight`. You can define more if you want a more accurate representation of a dog :)\
Our dog can also `bark`, and this is a __thing it can do__, or a `method`. Again, you can define more methods for your class.


<font color="#4caf50" size=4>
Now let's define a NeuralNetwork class.
</font>

- What is each part? 
    - `__init__` is where we specify the model architecture, 
        
        There are loads of layers (model parts) you can use,
        and it's all defined here.
        
    - `training step` is one of our model's methods. It updates the model paramters using an optimizer.
    
    - `configure_optimizers`, well, configures the optimizers 😅. Here we define what optimizer to use, including learning rate.
    
    - `forward` specifices what the model should do when an input is given.

In [30]:
class NeuralNetwork(pl.LightningModule):
    def __init__(self, input_sz, hidden_sz, output_sz, lr=1e-3):
        super().__init__()
        self.lr = lr
        
        # Define all the components
        self.layers = nn.Sequential(
            nn.Linear(input_sz, hidden_sz),
            nn.ReLU(),
            #nn.Linear(hidden_sz, hidden_sz),
            #nn.ReLU(),
            nn.Linear(hidden_sz, output_sz)
        )
        
    def training_step(self, batch, batch_idx):
        # Here we define the train loop.
        x, y = batch
        z = self.layers(x)
        loss = F.mse_loss(z, y)
        return loss

    def configure_optimizers(self):
        # Here we configure the optimization algorithm.
        optimizer = torch.optim.Adam(
            self.parameters(),
            lr=self.lr
        )
        return optimizer
    
    def forward(self, x):
        # Here we define what the NN does with its parts
        return self.layers(x).flatten()

### Dataset class

To use Lightning, we also need to create a `Dataset` class.\
It looks more complicated, but it actually allows a lot of flexibility in more complex scenarios! (so don't be daunted by this 😉)

In [18]:
from torch.utils.data import Dataset

class ESOLDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, idx):
        
        if torch.is_tensor(idx):
            idx = idx.tolist()
        X_ = torch.as_tensor(self.X[idx].astype(np.float32))
        y_ = torch.as_tensor(self.y[idx].astype(np.float32).reshape(-1))
        
        return X_, y_
    
train_data = ESOLDataset(X_train, y_train)
test_data = ESOLDataset(X_test, y_test)

In [45]:
train_loader = DataLoader(train_data, batch_size=254)
nn_model = NeuralNetwork(208, 254, 1, lr=1e-2)

# Define trainer: How we want to train the model
trainer = pl.Trainer(
    max_epochs=100
)

# Finally! Training a model :)
trainer.fit(
    model=nn_model,
    train_dataloaders=train_loader,
)

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name   | Type       | Params
--------------------------------------
0 | layers | Sequential | 53.3 K
--------------------------------------
53.3 K    Trainable params
0         Non-trainable params
53.3 K    Total params
0.213     Total estimated model params size (MB)
  rank_zero_warn(
  rank_zero_warn(


Training: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=100` reached.


In [46]:
from sklearn.metrics import mean_squared_error
def mse_nn(model, test_data):
    X_tens, y_tens = test_data.__getitem__(range(len(test_data)))
    y_pred = nn_model(X_tens).detach()
    y_pred = y_pred.numpy()
    y_tens = y_tens.numpy()
    
    return mean_squared_error(y_tens, y_pred)

def test_model(model, train_data, test_data):
    """
    Function that tests a model.
    Inputs: model, train_data, test_data
    """
    # Calculate RMSE
    mse_train = mse_nn(nn_model, train_data) ** 0.5
    mse_test = mse_nn(nn_model, test_data) ** 0.5
    print(f"RMSE on train set: {mse_train:.3f}, and test set: {mse_test:.3f}.\n")

In [47]:
test_model(nn_model, train_data, test_data)

RMSE on train set: 0.416, and test set: 0.692.



# Exercise:

Play with the hyperparameters, see what you get.

You may play around with `hidden_sz`, `batch_sz`, `max_epochs`, `lr`,\
or even modify the architecture of our neural network i.e. change the number of layers, activation function, etc.

### Inductive biases

Inductive biases are assumptions that are built into the design of a machine learning model. These biases can help the model learn more quickly and accurately, but they can also make it less flexible and adaptable to new situations. Inductive biases are a trade-off between accuracy and flexibility.


### How to train

Training a neural network involves selecting an appropriate architecture, initializing the weights of the model, and then iteratively adjusting the weights of the model to minimize the error between the predicted output and the actual output. This is typically done using an optimization algorithm such as gradient descent.

### Gradient descent

Gradient descent is an optimization algorithm that is used to train neural networks. It works by iteratively adjusting the weights of the model to minimize the error between the predicted output and the actual output. The goal is to find the weights that minimize the loss function.

### Backpropagation

Backpropagation is a method for computing the gradients of the loss function with respect to the weights of the model. It is used to update the weights during training. Backpropagation is an efficient way to compute the gradients of the loss function with respect to the weights of the model, and it enables the use of gradient-based optimization algorithms such as gradient descent.

### Loss functions

Loss functions are used to measure the difference between the predicted output of the model and the actual output. They are used to guide the optimization algorithm during training. Commonly used loss functions include mean squared error, cross-entropy loss, and binary cross-entropy loss.

### Is DL always better?

Deep learning models are not always the best choice for every problem. One of the challenges in deep learning is that the models can be highly sensitive to small changes in the input, which can result in poor performance on certain types of data. One example of this is the concept of activity cliffs in the chemical space. Activity cliffs are regions where small changes in the structure of a molecule result in large changes in its activity. Deep learning models may not always be the best choice for predicting these activity cliffs.

### LSTM from smiles
