# Exercise 04 ANN for Regression - Instruction

## Pedagogy

This notebook serves as an instruction for implementing an artificial neural network using Pytorch to develop a linear regression model.

Please use this notebook as a reference and guide to complete the assignment.

### Import libraries

In [None]:
# import libraries
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split

from torch import nn
from torch.utils.data import TensorDataset, DataLoader

## Part 1. Implement an ANN as a linear regression model

### Step 1. Build the data pipeline

Training an artificial neural network is an iterative process.

We need to feed a batch of training samples to the network at each iteration, make predictions, compute loss and gradients, and update learning parameters accordingly.

In practice, the whole training set is divided into several batches. The number of batches depends on the batch size and the size of training set. Once all batches are fed to the network, we say an epoch is completed.

Training a network for one epoch is usually not enough. Therefore, we need to go through the training sets repeatedly for many epoches to get a well-fitted model.

That's why we need to build a data pipeline that can continuously and iteratively feed batches of taining samples to the network.

`PyTorch` has two high-level APIs to work with data:
- `torch.utils.data.Dataset`
- `torch.utils.data.DataLoader`

`torch.utils.data.Dataset` stores the samples and their corresponding lables.
`torch.utils.data.DataLoader` wraps an iterable object around the `torch.utils.data.Dataset`.

For this notebook, we will use a toy dataset, the diabetes dataset, from `sklearn`. You can find more details about this dataset at this [link](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html).

As the diabetes dataset is a tabular dataset that consists of numerical feature and target variables. We can use `torch.utils.data.TensorDataset`, a custom class that inherits from the `torch.utils.data.Dataset` class.

In this step, we need to:
1. Load the diabetes dataset
2. Divide the dataset into the training and test set
3. Create a `TensorDataset` instance to store the training/test set
4. Create a `DataLoader` instance to wrap the training/test set as an iterable object

In [None]:
# load diabetes dataset
feature, label = datasets.load_diabetes(
    return_X_y = True,
    as_frame = False, # get data as numpy array
    scaled = True # ANN requires to scale the input features
)
# train test split
train_feature, test_feature, train_label, test_label = train_test_split(
    feature,
    label,
    train_size = 0.7,
    shuffle = True,
    random_state = 0
)

In [None]:
# create the train and test dataset
# specify the type of data stored in the tensors to avoid incompatiblity
train_ds = TensorDataset(
    torch.tensor(train_feature, dtype = torch.float32),
    torch.tensor(train_label, dtype = torch.float32)
)
test_ds = TensorDataset(
    torch.tensor(test_feature, dtype = torch.float32),
    torch.tensor(test_label, dtype = torch.float32)
)

In [None]:
# create the train and test data loaders
batch_size = 16 # usually set to 2 to the nth power
train_dl = DataLoader(train_ds, batch_size = batch_size, shuffle = True)
test_dl = DataLoader(test_ds, batch_size = batch_size, shuffle = False)
# shuffle = True means the data is reshuffled at every epoch
# recommend to reshuffle training data
# don't reshuffle test data since test data will be fed to network only once
# we may also need to keep the order of test samples in the test set

In [None]:
# get a minibatch from the data loader and print shape of feature and label
for (X, y) in train_dl:
    print(X.shape)
    print(y.shape)
    break

We can see the feature batch has two dimensions:
- 1st dimension is the batch size
- 2nd dimension is the number of features

The label batch has only one dimention, which is the batch size. This is because we only have one target variable as the label to predict. If there are multiple target variables to predice using the single network, the label batch will also have two dimensions.

### Step 2. Create the artificial neural network

To define a neural network in `PyTorch`, we create a class that inherits from `torch.nn.Module`.

We need to define the layers of the network in the `__init__()` funcion and specify how data will pass through the network in the `forward()` function.

In [None]:
# define a custom neural network class
class NeuralNetwork(nn.Module):
    def __init__(self, n_features, n_labels):
        super().__init__()
        self.net = nn.Linear(n_features, n_labels)
    def forward(self, X):
        return self.net(X)

We can see we defined a simple network with only one layer. The input size of this layer is the number of feature variables, the output size is the number of target variables. We didn't add any layer to represent activation function, which indicates the activation function is linear.

This is how we use artificial neural network as a linear regression model.

Read the `PyTorch` documentation:
- `torch.nn.Linear()` at this [link](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html).

Create a class to define the ANN is not enough. We also need to create an instance of the class.

In [None]:
# create the neural network
model = NeuralNetwork(
    n_features = feature.shape[1],
    n_labels = 1
)

In [None]:
# print network structure and learning parameters
print(f"Model structure:\n{model}\n")
for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()}")

### Step 3. Training by gradient descent

To train a neural network, we need:
- A data loader of the training set (batch size)
- A loss function
- An optimizer (learning rate)
- The number of epoches to train

We can define a `train()` function that takes the above elements as parameters and performs the training process.

In a single training loop, the neural network makes predictions on the training samples (fed to it in batches), and backpropagates the prediction error to adjust the learning parameters. It's also good to record the changes in loss for further analysis and adjustment.

Let's defind a `train()` function to implement this process.

In [None]:
# define the training function
def train(dataloader, model, loss_fn, optimizer, epochs):
    batch_loss_history = [] # for recording the average loss of a batch
    epoch_loss_history = [] # for recording the average loss of an epoch
    model.train() # set the model in training mode
    for epoch in range(epochs): # iterate pre-defined number of epoches
        epoch_loss = 0.0 # initial epoch loss is set to zero
        for (X, y) in dataloader: # get a batch of training samples
            pred = model(X).squeeze() # make predictions, squeeze() reduce `pred` to 1D tensor
            batch_loss = loss_fn(pred, y) # compute the current batch loss
            batch_loss.backward() # compute gradients by backpropagation
            optimizer.step() # update learning parameters according to gradients
            optimizer.zero_grad() # reset the gradients to zero
            batch_loss_history.append(batch_loss.item()) # record current batch loss
            epoch_loss += batch_loss.item() # accumulate batch losses for compute epoch loss
        epoch_loss /= len(dataloader) # compute current epoch loss
        epoch_loss_history.append(epoch_loss) # record current epoch loss
        print(f"Epoch {epoch + 1}: train loss = {epoch_loss}") # print log
    return batch_loss_history, epoch_loss_history

After define the `train()` function, we need to specify the loss function, learning rate, optimizer, and number of epoches before training.

In [None]:
# define the training hyper-parameters
loss_fn = nn.MSELoss()
learning_rate = 1e-2
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)
epochs = 1000

In [None]:
# train the neural network
batch_loss_history, epoch_loss_history = train(train_dl, model, loss_fn, optimizer, epochs)

In [None]:
# plot the train loss history
plt.figure()
batch = np.arange(len(batch_loss_history))
epoch = np.arange(len(epoch_loss_history))
batches_per_epoch = (int(len(train_feature) / batch_size + 1))
plt.plot(batch, batch_loss_history, '-', label = 'batch loss')
plt.plot(epoch * batches_per_epoch, epoch_loss_history, '-', label = 'epoch loss')
plt.title('train loss history')
plt.xlabel('batch')
plt.ylabel('train loss')
plt.legend()
plt.show()

The loss is gradually reduced by mini-batch gradient descent.

Due to the randomness in selecting each batch, we can find that although the batch loss shows a decreasing trend, it still fluctuates significantly.

However, we can't find such significant fluctuation of the epoch loss. This also proves the effectiveness of mini-batch gradient descent method.

### Step 4. Save and load a trained model

Training a neural network oftern takes a lot of time. We don't want to retrain every time we use it.

A better approach is to save the trained model after training is completed and load it again for subsequent use.

A common way to save a model is to serialize the internal state dictionary (containing the model parameters).

In [None]:
# save model
file_name = 'model.pth'
torch.save(model.state_dict(), file_name)
print('Saved PyTorch Model State to '+ file_name)

The process for loading a model includes re-creating the model structure and loading the state dictionary (containing the model parameters) into it.

In [None]:
# load model
model = NeuralNetwork(
    n_features = feature.shape[1],
    n_labels = 1
)
model.load_state_dict(torch.load(file_name))

The `<All keys matched successfully>` means we successfully loaded the trained model to the newly created neural network.

If the structure of the newly created network is different from the structure of the trained model, we will get an error.

### Step 5. Make predictions and evaluation

The loss value on the training set can only represent the fitness of the neural network to the training data. We also need to ensure the trained network generalize well on unseen data.

Therefore, we need to make predictions on the test set and evaluation the generalization ability.

We can feed the entire test set to the network to make predictions. Or we can do it in batch. It's better to do it in batch since we may deal with very big dataset, feeding the entire test set to the network may exceed the available memory of your computer.

To do that in batch, we can define a `test()` function, which is similar to the `train` function but much simpler.

In [None]:
# define a function to make predictions on test dataset and evaluate the performance
def test(dataloader, model, loss_fn):
    batch_pred_list = [] # for recording batch predictions
    model.eval() # set the model in evaluation mode
    with torch.no_grad(): # disable automatic gradient computing
        loss = 0.0 # set initial test loss to zero
        for (X, y) in dataloader: # get a batch from test samples
            batch_pred = model(X).squeeze() # make predictions, squeeze() reduce `batch_pred` to 1D tensor
            batch_loss = loss_fn(batch_pred, y) # compute current batch loss
            loss += batch_loss.item() # accumulate batch losses for compute test loss
            batch_pred_list.append(batch_pred) # record predictions on current batch
        loss /= len(dataloader) # compute test loss
        pred = np.concatenate(batch_pred_list) # reform the predictions as a numpy 1D array
        print(f"test loss = {loss}") # print log
    return pred, loss

In [None]:
# make prediction on test set and evaluate the performance
test_pred, test_loss = test(test_dl, model, loss_fn)

In [None]:
# plot the prediction results of the test dataset
plt.figure()
plt.plot(test_label, test_pred, '.')
plt.plot([min(test_label), max(test_label)], [min(test_label), max(test_label)], '-')
plt.xlabel('target value')
plt.ylabel('predicted value')
plt.show()

## Part 2. Implement a multi-layer ANN for regression

We already implement our 1st one-layer ANN as a linear regression model.
- There are only one input layer and one output layer in the network
- The number of input neurons in the input layer is equal to the number of input features
- The number of output neurons is 1, which is equal to the number of target variables to predict
- The activation function is somehow ignored, which indicates the output neuron adopts the linear activation function: $f(z)=z$.

Therefore, the output of this one-layer ANN is $\hat{y}=w_1x_1+w_2x_2+...+w_nx_n+b$, the same as a linear regression model.

However, we already knew that such model might not be complex enough to learn the hidden patterns in the dataset and to solve complex regression problems.

### Option 1. Increase the capacity of the ANN

In the context of ANN, the capacity of an one-layer ANN might be too small for a complex regression model. We can increase this capacity by constructing a multi-layer ANN, which stacks multiple layers in sequence to form a deeper ANN with larger capacity.

To do that, we need to be careful of:
- The input size of the first layer is equal to the number of features
- The input size of a layer is equal to the output size of the previous layer
- If we have multiple layers stacked together, we can create an ordered container of layers using `torch.nn.Sequential`.

In [None]:
# define a custom neural network class
class NeuralNetwork(nn.Module):
    def __init__(self, n_features, n_labels):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, 16),
            nn.Linear(16, 8),
            nn.Linear(8, 4),
            nn.Linear(4, n_labels)
        )
    def forward(self, X):
        return self.net(X)

We defined a 4-layer network with linear activation functions.

The `forward()` function remains the same. This is becaues we embedded multiple `nn.Linear()` layers into one `nn.Sequential()` container, which can be used as a single layer in forward propagation.

Read the `PyTorch` documentation:
- `torch.nn.Sequential()` at this [link](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html#torch.nn.Sequential).

This is how we can increase the capacity of the ANN by adding more layers in the network.

<span style="color:red">Now use this multi-layer ANN to build a regression model using the same dataset and steps in Part 1.</span>
- Can you increase model's performance comparing with the one-layer ANN?
- Try to change the batch size, learning rate, number of layers and neurons to achieve a good performance without increasing the capacity too much.
- If you get a loss value of `nan`, this is because the loss value exceeds the range that can be expressed by a `float32` object.
    - This is what we called gradient explosion
    - You can try to decrease the learning rate

### Option 2. Add non-linearity in ANN

No matter how many hidden layers and neurons we add to the network, the output is sill the linear combination of the input features, which can only represent the linear relationship between the feature vairables and the target variable.

To add non-linearity in the network, we can change the activation function from a linear one to a non-linear one. To do that:
- We can add activation function as a separate layer following the corresponding layer, for example, a `torch.nn.ReLU()` follows a `torch.nn.Linear()`
- Or we can embed the activation function into the `forward()` function, we will not show this way in this notebook.

The most popular non-linear activation functions are:
- ReLU, `torch.nn.ReLU()`, see the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU).
- Tanh, `torch.nn.Tanh()`, see the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html#torch.nn.Tanh).
- Sigmoid, `torch.nn.Sigmoid()`, see the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html#torch.nn.Sigmoid).

You can find all the pre-defined types of layers and activation function in PyTorch at this [link](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity).

These pre-defined layers and activation functions give us the great flexibility in the structure of the ANN.

Now let's introduce the non-linearity to the ANN by adding non-linear activation function layers.

In [None]:
# define a custom neural network class
class NeuralNetwork(nn.Module):
    def __init__(self, n_features, n_labels):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 4),
            nn.ReLU(),
            nn.Linear(4, n_labels)
        )
    def forward(self, X):
        return self.net(X)

We add the ReLU activation function for the three hidden layers in the network.

Note that, we usually don't add any non-linear activation function for the output layer.

<span style="color:red">Now use this non-linear ANN to build a regression model using the same dataset and steps in Part 1.</span>
- Can you increase model's performance comparing with the linear ANN?
- Try to change the activation functions and also other hyper-parameters to achieve a good performance.
    - If your network always predict the average value of the target variables, it means the capacity of the network is not enough to capture the non-linear relationship, try to increase the capacity.
    - If the train loss is much lower than the test loss, it means your network is over-fitted. Try to decrease the capacity.

We often refer to the training process of deep learning as alchemy because there are so many hyperparameters that we can adjust.

We will learn more techniques in later lessons to ensure that our models are not overfitting or underfitting. But experience is still extremely important.

Try as many different combinations of training parameters as possible, gain experience by observing the problems you encounter and solving them by adjusting the parameters.