<a href="https://www.kaggle.com/code/aisuko/building-a-neural-networks?scriptVersionId=164240370" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Neural network is a collection of neurons that are connected by layers. Each neuron is small computing until that performs simple calculations to collectively solve a problem. They are organized in layers. There are 3 types of layers: 

* input layer
* hidden layer
* output layer

Each layer contains a number of neurons, except for the input layer. Neural networks mimic the way a human brain processes information.


# The components of a neural network


# Common Types of Activation Functions

***Activation function determines whether a neuron should be activated or not***. The computations that happen in a neural network include applying an activation function. If a neuron activates, then it means the input is important. They are different kinds of activation functions. **The choice of which activation funciton to use dependes on what you want the output to be. Another important role of an activation function is to add non-linearity to the model**.


## Binary 

Binary used to set an output node to 1 if function result is positive and 0 if the function result is negative

$$f(x)= {\small \begin{cases} 0, & \text{if } x < 0\\ 1, & \text{if } x\geq 0\\ \end{cases}}$$

## Sigmod

It is used to predict the probability of an output node being between 0 and 1

$$f(x) = {\large \frac{1}{1+e^{-x}}}$$

## Tanh

It is used to predict the probability of an output node being between -1 and 1. Used in classification use cases.

$$f(x) = {\large \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}}$$

## ReLU(Rectified Linear Unit)

It used to set the output node to 0 if function result is negative and keeps the result value if the result is a postive values.

$$f(x)= {\small \begin{cases} 0, & \text{if } x < 0\\ x, & \text{if } x\geq 0\\ \end{cases}}$$

where $f(x)$ is the activation function.


## Weights

`Weights` influence how well the output of our network will come close to the expected output value. As an input enter the neuron, it gets multiplied by a weight value and the resulting output is either observed, or passed to the next layer in the neural network. Weights for all neurons in a layer are organize into one tensor.


## Bias

`Bias` makes up the difference between the activation function's output and its intended output. A low bias suggest that the network is making more assumptions about the form of the output, whereas a high bias value makes less assumptions about the form of the output.

We can say that output $y$ of a neural nework layer with weights $W$ and bias $b$ is computed as summation of the inputs multiply by the weights plus the bias

$$x = \sum{(weights * inputs) + bias}$$

# Building a neural network

Neural networks are comprised of layer/modules that perform operations on data. The `torch.nn` namespace provides all the building blocks you need to build your own neural network. Every module in PyTorch subclasses the `nn.Module`. A neural network is a module itself that consists of other modules (layers). This nested structure allows for building and managing complex architectures easily. In the following sections, we will build a neural network to classify images in the FashionMNIST dataset.

In [1]:
%matplotlib inline
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

In [2]:
import os
import torch
import warnings


if torch.cuda.is_available():
    torch_device = 'cuda'
else:
    torch_device = 'cpu'

warnings.filterwarnings('ignore')

print(torch_device)

cuda


# Define the class

We define our neural network by subclassing `nn.Module`, and initialize the neural network layers in `__init__`. Every `nn.Module` subclass implements the operations on input data in the `forward` method.

Our neural network are composed of the following:

* The input layer with 28x28 or 784 features/pixels
* The first linear module takes the input 784 features and transforms it to a hidden layer with 512 features
* The ReLU activation function will be applied in the transformation
* The second linear module take 512 features as input from the first hidden layer and transforms it to the next hidden layer with 512 features
* The ReLU activation function will be applied in the transformation
* The third linear module take 512 features as input form the second hidden layer and transforms it  to the output layer with 10, which is the number of classes
* The ReLU activation function will be applied in the transformation


In [3]:
class NeuralNetwork(nn.Module):
    def __init__(self) -> None:
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.ReLU()
        )
    def forward(self,x):
        x =self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

Move the Neural Network to the device we defined earlier and print the model's architecture.

In [4]:
model = NeuralNetwork().to(torch_device)
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
    (5): ReLU()
  )
)


To use the model, we pass it the input data. This executes the models `forward`, along with some background operation. However, do not call `model.forward()` directly. Calling the model on the input returns a 10-dimensional tensor with raw predicted values for each class. We get the prediction densities by passing it through an instance of the `nn.Softmax`.

In [5]:
X =torch.rand(1,28,28, device=torch_device)
logits = model(X)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")

Predicted class: tensor([1], device='cuda:0')


# Weight and Bias

The `nn.linear` module randomly initialized the *weights* and *bias* for each layer and internally stores the values in Tensors.

In [6]:
print(f'First Linear weights: {model.linear_relu_stack[0].weight} \n')
print(f'First Linear bias: {model.linear_relu_stack[0].bias} \n')

First Linear weights: Parameter containing:
tensor([[ 0.0272, -0.0109, -0.0138,  ...,  0.0176, -0.0099, -0.0291],
        [-0.0113, -0.0087,  0.0291,  ..., -0.0325,  0.0020,  0.0321],
        [-0.0207,  0.0073,  0.0327,  ...,  0.0228, -0.0054,  0.0249],
        ...,
        [-0.0233,  0.0311,  0.0005,  ..., -0.0251,  0.0259,  0.0288],
        [ 0.0352, -0.0249,  0.0155,  ...,  0.0191, -0.0153, -0.0054],
        [ 0.0229, -0.0135,  0.0256,  ...,  0.0344, -0.0230,  0.0100]],
       device='cuda:0', requires_grad=True) 

First Linear bias: Parameter containing:
tensor([-2.4048e-02, -2.0411e-02,  1.6948e-02, -2.8014e-02,  2.6954e-02,
         1.6698e-03, -3.4415e-02, -3.9478e-03, -1.8268e-02, -1.4037e-02,
        -1.8889e-02,  2.0402e-02, -3.0342e-02, -2.5653e-02, -6.1274e-03,
         1.5839e-02,  5.6672e-03, -3.8207e-03, -7.0029e-04,  9.1348e-03,
         1.1926e-02, -1.8516e-02, -3.1290e-02,  1.7284e-02, -3.0562e-02,
        -1.1724e-02, -5.2576e-03, -2.4647e-02,  4.2638e-03, -3.5130e-0

# Model layers

Let's break down the layers in the FashionMNIST model. 

To illustrate it, we will take a sample minibatch of $3$ images of size $28x28$ and see what happens to it as we pass it through the network.

In [7]:
input_sample = torch.rand(3,28,28)
print(input_sample.size())

torch.Size([3, 28, 28])


# nn.Flatten

We initialize the `nn.Flatten` layer to convert each 2D $28x28$ image into a contiguous array of $784$ pixel values (the minibatch dimension (at dim=0) is maintained). Each of the pixels are pass to the input layer of the neural network.

In [8]:
flatten = nn.Flatten()
flat_image = flatten(input_sample)
print(flat_image.size())

torch.Size([3, 784])


# nn.Linear

The linear layer is a module that applies a linear transformation on the input using it's stored weights and biases. The gayscale value of each pixel in the input layer will be connected to neurons in the hidden layer for calculation. The calculation used for the transformation is $weight * input + bias$.

In [9]:
layer1 = nn.Linear(in_features=28*28, out_features=20)
hidden1 = layer1(flat_image)
print(hidden1.size())

torch.Size([3, 20])


# nn.ReLU

**Non-linear activations are what create the complex mappings between the model's input and output**. They are applied after linear tranformations to introduce `non-linearity`, helping neural networks learn a wide variety of phenomena. In this model, we use `nn.ReLU` between our linear layers, but there's other activations to introduce non-linearity in your model. 

The ReLU activation function takes the output from the linear layer calculation and replaces the negative values with zeros.

Linear output: ${ x = {weight * input + bias}}$


ReLU: $f(x)= {\small \begin{cases} 0, & \text{if } x < 0\\ x, & \text{if } x\geq 0\\ \end{cases}}$

In [10]:
print(f'Before ReLU: {hidden1}\n\n')
hidden1 = nn.ReLU()(hidden1)
print(f'After ReLU: {hidden1}')

Before ReLU: tensor([[-0.4502, -0.3366, -0.1308,  0.3422,  0.2167,  0.2061, -0.1522,  0.3993,
         -0.4458,  0.3211, -0.0753, -0.4514, -0.0242, -0.3593, -0.4815,  0.3419,
         -0.0016,  0.2304,  0.1037,  0.1159],
        [-0.2365, -0.1137, -0.1026,  0.5273,  0.3703,  0.3312,  0.2213,  0.3376,
         -0.3838,  0.5399,  0.2080, -0.1127,  0.1466, -0.4485,  0.1284,  0.3791,
         -0.0809, -0.0686, -0.1838, -0.1539],
        [-0.5465, -0.5077,  0.1567,  0.5351,  0.0803,  0.5871, -0.1357,  0.6539,
         -0.6601,  0.3047,  0.2592, -0.3402, -0.1322, -0.1271, -0.2722, -0.0710,
         -0.4564,  0.1677,  0.2156, -0.2166]], grad_fn=<AddmmBackward0>)


After ReLU: tensor([[0.0000, 0.0000, 0.0000, 0.3422, 0.2167, 0.2061, 0.0000, 0.3993, 0.0000,
         0.3211, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.3419, 0.0000, 0.2304,
         0.1037, 0.1159],
        [0.0000, 0.0000, 0.0000, 0.5273, 0.3703, 0.3312, 0.2213, 0.3376, 0.0000,
         0.5399, 0.2080, 0.0000, 0.1466, 0.0000, 0.12

# nn.Sequential

`nn.Sequential` is an ordered container of modules. The data is passed through all the modulers in the same order as defined. You can use sequential containers to put together a quick network like `seq_modules`.

In [11]:
seq_modules = nn.Sequential(
    flatten,
    layer1, nn.ReLU(),
    nn.Linear(20,10)
)
input_image = torch.rand(3,28,28)
logits = seq_modules(input_image)

# nn.Softmax

The last linear layer of the neural network returns `logits` - raw values in [-infty, infty], which are passed to the `nn.Softmax` module. The Softmax activation function is used to calcualte the probability of the output from the neural network. **It is only used on output layer of a neural network, the results are scaled to values [0,1] representing the model's predicted densities for each class**. `dim` parameter indicates the dimension along which the result values must sum to 1. The node with the highest probability predicts the desired output.

In [12]:
pred_probab = nn.Softmax(dim=1)(logits)
pred_probab

tensor([[0.1105, 0.0846, 0.1073, 0.1273, 0.0767, 0.1037, 0.0814, 0.1042, 0.0890,
         0.1153],
        [0.1120, 0.0978, 0.1089, 0.1269, 0.0775, 0.0906, 0.0869, 0.0925, 0.0914,
         0.1157],
        [0.1177, 0.0887, 0.1069, 0.1185, 0.0813, 0.0897, 0.0921, 0.0927, 0.0967,
         0.1158]], grad_fn=<SoftmaxBackward0>)

# Model parameters

Many layers inside a neural network are `parameterized`, i.e. have associated weights and biases that are optimized during training. Subclassing `nn.Module` automatically tracks all fields defined inside your model object, and makes all parameters accessible using your model's `parameters()` or `named_parameters()` methods.

In [13]:
print('Model structure:', model, '\n\n')

for name, param in model.named_parameters():
    print(f'Layer: {name} | Size: {param.size()} | Values: {param[:2]} \n')

Model structure: NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
    (5): ReLU()
  )
) 


Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values: tensor([[ 0.0272, -0.0109, -0.0138,  ...,  0.0176, -0.0099, -0.0291],
        [-0.0113, -0.0087,  0.0291,  ..., -0.0325,  0.0020,  0.0321]],
       device='cuda:0', grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values: tensor([-0.0240, -0.0204], device='cuda:0', grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.2.weight | Size: torch.Size([512, 512]) | Values: tensor([[ 0.0242,  0.0106, -0.0143,  ...,  0.0184,  0.0353,  0.0163],
        [-0.0338, -0.0241, -0.0397,  ..., -0.0380,  0.0086, -0.0161]],
       device='cuda:0