In [1]:
import torch, torchvision
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Multi-Layered Perceptrons
MLPs are nearly synonomous with Neural Networks so it makes sense to start discussing CNNs by discussing MLPs. Overall they are just a composition of perceptrons (first formulated by a guy named Rosenblatt back in the 50s/60s) meshed together to receive inputs, outputs, and initialized weights. 

The key to these tools is finding the optimal weight. That is what training a neural network actually does, is it finds the weights for each of the inputs. To calculate the change in weights we need to deal with a popular algorithm called backpropagation. To understand in depth what is happening, you need to have a bit of calculus in your background, otherwise just trust me when I say it works and skip the next paragraph.

The output of a perceptron is the multiplication of the weight and input fed through an activation function. At the end of the network a loss is calculated for the network. Well, to get the necessary change in weights we simply go backwards. We take the derivatives of each of those steps and chain them together using the chain rule:

$$\delta_j(k) = -\frac{\partial E(k)}{\partial e_j(k)}\frac{\partial e_j(k)}{\partial y_j(k)}\frac{\partial y_j(k)}{\partial v_j(k)}$$

$$\delta_j(k) = e_j(k)\Phi_j'(v_j(k))$$

Where $e_j(k)$ is the difference between the output and label and $\Phi_j'(v_j(k))$ is the derivative of the activation function with the product of the inputs and weights. Finally, we can achieve the necessary change in weights by simply doing the gradient descent rule at each perceptron, or: 

$$\Delta w_{ij}(k) = \alpha \delta_j(k)y_j(k)$$ 

with $y_j(k)$ being the inputs and $\alpha$ being the learning rate. This is briefly described in the function below. And please keep in mind this is for a generic fully connected neural network, not a CNN (yet!)

In [2]:
def train_mlp(activation, e, labels, alpha, beta):
    previous_h1 = 0
    previous_in = 0
    epoch = e
    err = np.zeros((epoch,10))
    data = np.asarray(train_image_list)
    input_weights = np.random.rand(196, 100) #initialize weights via a random distribution
    h1_weights = np.random.rand(100, 10)
    for k in range(epoch):
        err[k] = 0
        for i in range(15000):
            
            # sample and preprocess 
            d = data[i,:]
            d = np.divide(d,255)
            d = np.reshape(d, (1,-1))

            # Forward Pass Into Hidden Layer 1
            fp1 = activation(np.dot(d, input_weights))

            # Forward Pass Into Hidden Layer 2
            # fpo = activation(np.dot(fp1, h1_weights)) #enable for sigmoid
            fpo = np.dot(fp1, h1_weights) 

            # ERROR RATE CALCULATION
            # Reshape labels to fit error calculation
            l = np.reshape(labels[i,:], (1,-1))
            err[k] = err[k] + (1.0/2.0) * (np.power((fpo - l), 2.0))

            # BACKPROP
            deltas_out_layer = (-1) * (l - fpo)
            partial_out = np.transpose(fp1) * deltas_out_layer

            deltas_h1_layer =  (activation(fp1, derive=True)) * np.dot(deltas_out_layer, np.transpose(h1_weights))
            partial_h1 = np.transpose(d) * deltas_h1_layer
            
            momentum_h1 = (beta * previous_h1) + ((-alpha) * partial_out) 
            h1_weights = h1_weights + momentum_h1
            previous_h1 = momentum_h1

            momentum_in = (beta * previous_in) + ((-alpha) * partial_h1)
            input_weights = input_weights + momentum_in
            previous_in = momentum_in

    plt.plot(err[0:24])
    plt.ylabel('error')
    plt.xlabel('epochs')
    plt.show()
    return input_weights, h1_weights

### ConvNets/CNNs and Convolutions
A CNN is very similar to a standard fully connected neural network, except instead of connecting each output of the previous layer to each perceptron in the next layer, it shares some of them. This is done through convolutions (a concept used earlier in image processing and applied to neural networks). 

In essence a convolution is a secondary smaller matrix (usually called a kernel, window, or mask) moving across an image to gather information about it. The distance the kernel moves from one calculation to the next is called the **stride**.

The entire point of a convolutional neural network is to find the correct values that belong in the kernel matrix, that is what is being optimized in a CNN. These kernel matrices are updated in backpropagation. In doing so, they become pseudo-feature detectors in images and become really good at locating edges, color differentials, etc. For more examples of what standardized kernels can do and look like see [this wiki page](https://en.wikipedia.org/wiki/Kernel_(image_processing))

Really quickly, just so we can get going, lets talk about one more term called **transfer learning**. This is basically the idea of taking a pretrained neural network (trained on other datasets etc.) and using those initial weights and features as a starting point for your training. 

Great, now lets go into [pytorch](https://pytorch.org/)! From the start, we shall a [quick demo of transfer learning](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) provided by pytorch

Convolution Visualization: https://gfycat.com/plasticmenacingdegu