<a href="https://colab.research.google.com/github/KenJiangg/Learning-Deep-Learning/blob/master/PyTorchNNTutorial_W_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feed Forward Neural Networks from the PyTorch Tutorial
  This is my take/interpretation of the PyTorch Tutorial and I attempt to explain some of the things I needed to learn as a complete beginner. 

# Useful background knowledge about Neural Networks
<a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)"> ReLu Definition </a> <br>
<a href="https://en.wikipedia.org/wiki/Kernel_(image_processing)"> Kernel Definition </a> <br>
<a href = "https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py"> PyTorch Tutorial </a>  
<a href="http://cs231n.stanford.edu/handouts/derivatives.pdf"> Backpropagation </a>
<br><br>
**Brief Explanations about the concepts needed** <br>
*ReLu* :<br>
ReLu is a rectified linear unit. It is a type of activation function( Activation functions maps inputs that aren't neccessarily probabilistic to probabilistic values which are values either from -1 to 1 or 0 to 1.  Activation functions such as ReLu introduces non-linearity into neural networks because it's derivative is  $ f'(x) =\begin{cases} 0,  & \text{if $x$ < 0} \\ 1 & \text{if $x$ > 0} \end{cases}. $. 
ReLu is a popular activation function because it speeds up training; the gradient of a computation is either a 1 or 0 depending on whether x is negative or positive.
<br><br>
*Backpropagation*:<br>
In a neural network, a single layer is typically a function of weights(*w*) and inputs(*x*). After processing a matrix through a neural network, it results in an output vector (*y*) and when we run a loss function on the vector; it results in a scalar loss (*L*). From this, assume we can compute $ \frac{\partial L}{\partial y} $. The values we want to obtain are $ \frac{\partial L}{\partial w} $ and $ \frac{\partial L}{\partial x} $. One way of getting the two values is by computing $ \frac{\partial y}{\partial w} $ and $ \frac{\partial y}{\partial x} $ and use matrix multiplication to obtain the values. However, when working with typical neural networks, this method fails because of how much memory is needed to store the Jacobian matrix. For neural networks, we can actually avoid computing the Jacobian matrix by using small cases. <a href="http://cs231n.stanford.edu/handouts/derivatives.pdf"> Math behind backpropagation and why we can use small cases </a>

# Learning about Neural Networks 

  <img src="https://pytorch.org/tutorials/_images/mnist.png"> 
  I had trouble understanding this picture and Neural Networks at first but what helped me understand the concept/design of a Feed-Forward Neural Network was matching the diagram to the code. <br><br>
  INPUT -> C1(self.conv1 = nn.Conv2d(1,6,5)): <br>
   Parameters for Conv2d in the tutorial are respecitvely input_channels(in this case it would be 1), output_channels(in this case it would be 6) and kernel size(in this case it is 5 which translates to a 5x5 kernel size). Review <a href="https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/conv.py"> this link</a> for further information about the parameters for the convolutional layer. <br>
  C2 -> S2 (x= F.max_pool2d(F.relu(self.conv1(x)), (2, 2))): <br>
  From what I understand ReLu(rectified linear unit) and max_pool2d(pooling layer) is a subsampling method that PyTorch uses here to reduce the dimensionality of the feature map in order to increase efficiency.  <br> 
  S2 -> C3(self.conv2 = nn.Conv2d(6,16,5)): <br>
  Same as INPUT->C1<br>
 C3 ->S4(x = F.max_pool2d(F.relu(self.conv2(x)), 2)): <br>
  Same as C2->S2<br>
  S4 -> C5(self.fc1 = nn.Linear(16 * 5 * 5, 120) &  x = F.relu(self.fc1(x))
 & x = x.view(-1, self.num_flat_features(x))): <br>
  Here is where the matrixes/channels are flattened into one long vector. Specifically, it first flattens by the code(x = x.view(-1, self.num_flat_features(x))) where num_flat_features is a function defined in the class. Next, ReLu is applied to fc1(x) where fc1 is channel which it takes an input of 16 by 5 by 5(in step S2 -> C3, the output of it/ the input we are using now is 16 channels, in step C3-S4 -> the convolution layer is subsampled to a 5 x 5 layer) and outputs a vector of a size of 120. <br>
  C5 -> F6(self.fc2 = nn.Linear(120, 84) & x = F.relu(self.fc2(x))):<br>
  The input channel is defined as a size of 120 and output as a size of 84; ReLu is once again applied to this channel as a non-linear method. <br>
  F6 -> OUTPUT(self.fc3 = nn.Linear(84, 10) & x = self.fc3(x)): <br>
    Final step where input is a size of 84 and output is a size of 10. <br><br>
  When running an image through these steps, we transform an image into a dense vector with a size of 10. Later on the tutorial, we learn how we can manipulate these vectors to reduce loss. 

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


The learnable parameters of the neural net defined above is 10


In [3]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

10
torch.Size([6, 1, 5, 5])


When we input a random number generated 32 by 32 matrix into the neural network, it outputs a vector with a size of 10.


In [4]:
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[ 0.0071,  0.0359, -0.0627, -0.0781, -0.0391, -0.1182, -0.0906, -0.1089,
         -0.0271,  0.0129]], grad_fn=<ThAddmmBackward>)


In [0]:
net.zero_grad() # zeroes the gradient buffer
out.backward(torch.randn(1, 10)) # backpropagates with random gradients generated by torch.randn

Loss functions are a metric for finding out how far away the output is from the target. Here, we use net(input( a random 32x32 matrix)) as our output from the neural network and target(randn(10)) as the target variable we use to measure how far away our neural network outputs are from the actual values(in this case the actual values are randomly generated). The function above uses a Mean Squared Error as their type of loss function and we use this function to compare the output to target.



In [6]:
output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.7968, grad_fn=<MseLossBackward>)


In [7]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward object at 0x7f2613beca90>
<ThAddmmBackward object at 0x7f2613becc88>
<ExpandBackward object at 0x7f2613beca90>


After finding out the loss from above, we can actually apply this loss through ** backpropagation** and adjust the weights of our 



In [8]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([-0.0061,  0.0188, -0.0018, -0.0167, -0.0186,  0.0097])


In [0]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

Update rules are used to update the neural network rates

In [0]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

In the case of neural networks, you would likely want to use a premade update rule, in this case we use Stochastic Gradient Descent(SGD in the code). The optimizer allows us to optimize the weights in our neural netwrok.

In [12]:
print(net.conv1.bias.grad) #after using the stochastic gradient descent update rule

tensor([-0.0049,  0.0070,  0.0058, -0.0181, -0.0141, -0.0063])
