# Neural Networks

## Training Procedure
- Define the neural network that has some learnable parameters (or weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule: `weight = weight - learning_rate * gradient`

In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F

## Convolution Process
- Each filter has 𝑛 slices, where each slice corresponds to an input channel.
- Perform element-wise multiplication between each filter slice and its corresponding input channel.
- Sum the results across all slices of the filter.
- The final sum is stored as one value in the output feature map.
- Repeat this process for every spatial position in the input.
- Each filter produces one output channel, so with m filters, we get m output channels.

- Input shape = (3,5,5) → 3 channels (RGB), 5×5 size
- Filter shape = (3,3,3) → 3 slices per filter, 3×3 kernel size.
- Number of filters = 4 → So, output has 4 channels.
- Output shape = (4,3,3) (assuming no padding and stride 1).

### convolution process involves:

1. Sliding the filter (kernel) over the input:
    - The filter moves across the input image in small steps (defined by stride).

2. Performing element-wise multiplication:
    - At each position, the corresponding values of the input and the filter are multiplied element-wise.

3. Summing the results:
    - The sum of all element-wise products gives one value in the output feature map.

4. Repeating this process for every position in the input:
    - This creates the full output feature map.


## MNIST dataset
- image size = `(28, 28, 1)`


![nn image](images/mnist.png)

In [11]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)

        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    
    def forward(self, imput):
        
        c1 = F.relu(self.conv1(input))

        s2 = F.max_pool2d(c1, (2,2))

        c3 = F.relu(self.conv2(s2))

        s4 = F.max_pool2d(c3, 2)

        s4 = torch.flatten(s4, 1)

        f5 = F.relu(self.fc1(s4))

        f6 = F.relu(self.fc2(f5))

        output = self.fc3(f6)

        return output

net = Net()
print(net)  
    


Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


## Explanation of each layer
- **c1** 
    - 1 input image channel
    - 6 output channel
    - 5x5 square convolution => kernel size = 5
    - ReLU activation
    - output = Tesor of size (N, 6, 28, 28) , [N = batch size]

- **s2**
    - maxpooling using 2x2 grid
    - output size = (N, 6, 14, 14)

- **c3**
    - 6 input channel
    - 16 uptput channel
    - 5x5 square convolution => kernel size = 5
    - ReLU activation
    - output = Tesor of size (N, 16, 10, 10) 

- **s4**
    - maxpooling using 2x2 grid
    - output size = (N, 16, 5, 5)
    - (n, 400) tensor input
    - output = (n, 120) tensor
    - ReLU activation

- **f5**
    - (n, 120) tensor input
    - output = (n, 84) tensor
    - ReLU activation

- **f6**
    - (n, 84) tensor input
    - output = (n, 10) tensor
    - ReLU activation

In [14]:
params = list(net.parameters())
print(len(params))
print(params[0].size())

10
torch.Size([6, 1, 5, 5])


In [15]:
# random input

input = torch.rand(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[ 0.1089, -0.0595, -0.0289, -0.0165, -0.0364,  0.0878,  0.0088,  0.0986,
         -0.0687, -0.0658]], grad_fn=<AddmmBackward0>)


In [16]:
net.zero_grad()
out.backward(torch.randn(1,10))

when call loss.backward(), the whole graph is differentiated w.r.t. the neural net parameters, and all Tensors in the graph that have requires_grad=True will have their .grad Tensor accumulated with the gradient.

In [17]:
output = net(input)
target = torch.randn(10)
target = target.view(1, -1)
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.2746, grad_fn=<MseLossBackward0>)


input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> flatten -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss

In [18]:
print(loss.grad_fn)
print(loss.grad_fn.next_functions[0][0])
print(loss.grad_fn.next_functions[0][0].next_functions[0][0]) 

<MseLossBackward0 object at 0x00000216D487DCC0>
<AddmmBackward0 object at 0x00000216D49374C0>
<AccumulateGrad object at 0x00000216D49374C0>


In [19]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
None
conv1.bias.grad after backward
tensor([-0.0001,  0.0101, -0.0135,  0.0102, -0.0006,  0.0000])


weight = weight - learning_rate * gradient

In [20]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)