> **Topic:** Introduction to Neural Network using PyTorch
>
> **Module:** Building the neural network model
>
> **Presentor:** Industry Sandbox and AI Computing (ISAIC)
>
> **Date:** 

## Neural Netwrok 101

- **The Problem**

In this example, we will explore the classification problem: we have a dataset of images that belongs to one of the 10 classes. The objective is to build a machine learning model that correctly identifies the class of a given image.

In general, let us denote the entire dataset as $X = \{x_1, x_2, ..., x_N\}$ containing $N$ samples. And $Y = \{y_1, y_2, ..., y_N\}$ is the set of labels denoting the class of each sample. In our example, the values of $y_i$ belong to the discrete set $\{0,1,2,3,4,5,6,7,8,9\}$ as there are only 10 classes.

Our goal is to build a neural network function denoted as $f_{\theta}$, which is parametrized by a set of parameters $\theta$ that correctly maps each input sample $x_i$ to its class $y_i$:

$$f_{\theta}(x_i) = \hat{y}_i,$$


where $\hat{y}_i$ is the expected class from the model output. Our challenge is now to find the parameters $\theta$ that best fulfill the requirements of the above equation. To solve this, we formulate a so-called **Loss function** denoted as $\mathcal{L}$ that measures how effectively the model maps each input sample to its class label correctly. During training, the optimization process basically changes the parameters in small steps and observes how the loss function value changes and tries to find the set of values for the parameters that best descibes the minimum value of the loss function $\mathcal{L}$.

- **Structure of Basic Deep Neural Network Model**

The basic building block of a neural network is given by the *perceptron*. Given a input of D-dimensional vector $x_i \in \mathbb{R}^D$, the perceptron is defined as,

$$P = \sigma(x^Tw+b),$$
where,
- $\{w,b\}$ are the parameters and $x^Tw+b$ is a type of linear mapping called affine transformation.
- $\sigma$ is a non-linear function, called **activation function**.

In our example, we will build the basic neural network model by applying the perceptron operation multiple times iteratively on the output of the last layer. This is why the simplest NN model is also called *multilayer perceptron (MLP)*. The following diagram visualizes the structure of a MLP model.

<img src=https://1.cms.s81c.com/sites/default/files/2021-01-06/ICLH_Diagram_Batch_01_03-DeepNeuralNetwork-WHITEBG.png alt="Drawing" style="width: 500px;"/>

*Image credit* : https://1.cms.s81c.com/sites/default/files/2021-01-06/ICLH_Diagram_Batch_01_03-DeepNeuralNetwork-WHITEBG.png


The first input is out D-dimensional input data vector, called *Input Layer*. The last output for our example should be a 10-dimensional vector (representing the classes) and is called *Output Layer*. All the intermediate input/output of the perceptron operations are called *hidden layers*. In general, we can write the perceptron operation for any layer $L+1$ using the output from layer $L$,

$$X_{L+1} = \sigma(X^T_{L}w+b).$$

Weights and biases $\{w_j,b_j\}$ where $j \in \{0,1,...,L\}$ from all the layers forms the parameter set $\theta$ for our model $f_{\theta}$.

Note that the dimensions of the hidden layer can be different from input/output layer dimensions and is considered as a user choice when building the model.

- **Choice of Activation Function**

There are several non-linear activation function that are used in artificial NN. Following are some of the examples:

  - `tanh` (Hyperbolic Tangent)
  - `sigmoid` (also called logisitic function)
  - `ReLU` (Rectified Linear Unit)
  - `SoftMax`
  - `LogSoftMax` (Used for classification)

For our classification problem, we will use `ReLU` activation for all the hidden layers and `LogSoftMax` for the final output layer. The output of the `LogSoftMax` value for our 10-class output layer represents the probability of the sample beonging to each of the class.

- **Choice of Loss Function**

The choice of loss function also depends on the problem are solving. For example, if the classification problem only contains two classes of samples, we a loss function called `binary cross entropy`. If the problem involves more than two classes (which is the case for our 10-class problem), we use a loss function called `negative log-likelihood (nll)` which is the negative log value of the output layer (i.e. output of the `LogSoftmax` function).

## Building the model with `torch.nn`

`torch.nn` library contains all the required tools for us to build a neural network model.

In particular, `nn.Module` forms the base class for all neural network modules. For our model class, we need to inherit from the base `nn.Module` class with two standard functions: `__init__` function is where we define and initialize the model parameters and `forward` class where we compute the output of our model for given sample. Also, note that all the classes defined with base `nn.Module` can also contain other modules, so we can nest them in tree structure for more complicated models. To build our model, we need to define each step of parameters and operations (e.g. layers) in sequence and stack them together to get the final model output. Each of these steps can be referred as *submodule*.

Following are some of the `torch.nn` methods that we will use to build our model:


- `nn.ModuleList` is a method that holds multiple submodules in a list. So we can compute each step iteratively for a given input.

- `nn.Linear` method defines one layer of affine transformation (explained above)

- `nn.ReLU` applies the ReLU activation function on the input

- `nn.LogSoftmax` applies the LogSoftmax function on the input

We will also use two regularization method (batch normalization and dropout) to prevent overfitting and stable optimization for our model. `nn.BatchNorm1d` and `nn.Dropout` functions implements these two regularization.

Now that we have introduced the basic conecpt of NN and the pytorch tools, let's build our first model.




In [1]:
#load necessary modules
import torch.nn as nn
import torch.nn.functional as F

#define the model class
class ImageClass(nn.Module):
    '''
    Main class that defines the Neural Net model (inheritence from nn.Module)
    Input - input_dim (int): Input dimension for each sample data
            hidden_dim (list): Each element in the list denotes the dimension
                            of each hidden layer
            output_dim (int): Output dimension (in our example case, 10 for 
                            total 10 classes of image)
            dropout_rate (float) [optional]: Rate at which Dropout regularization is applied
            use_batchnorm (bool) [optional]: Whether to use batch normalization after each
                                            hidden layer affine transformation
            **kwargs: Additional keyword arguments
    '''
    
    def __init__(self, input_dim, hidden_dim=[64,64,64], output_dim=10,
                dropout_rate = 0.0, use_batchnorm=False, **kwargs):
        super(ImageClass, self).__init__(**kwargs)
        
        #define an empty ModuleList container
        self.linear_model = nn.ModuleList()
        
        #first flatten our 2D image data into one dimension
        self.linear_model.append(nn.Flatten())
        
        #then we build our hidden layers iteratively (based on the # hidden layers)
        for i, (in_channel, out_channel) in enumerate(
                                zip([input_dim]+hidden_dim[:-1], hidden_dim)):
            #we first build the affine transformation
            self.linear_model.append(nn.Linear(in_channel, out_channel, bias=True))
            if use_batchnorm:
                #then apply batch normalization, if turned on for the model
                self.linear_model.append(nn.BatchNorm1d(out_channel))
            #we then apply the activation function
            self.linear_model.append(nn.ReLU())
            if dropout_rate:
                #we also add dropout is this regularization is turned on
                self.linear_model.append(nn.Dropout(dropout_rate))
        #add the last layer, i.e. the model output
        self.linear_model.append(nn.Linear(hidden_dim[-1], output_dim, bias=True))
        self.linear_model.append(nn.LogSoftmax(dim=1))
        
    def forward(self,x):
        for layer in self.linear_model:
            x = layer(x)
        return x

## Bonus: Dive into Activation Functions, Loss Functions, Regularizations