# Accelerate your AI journey with Pre-Trained Models 


### Setup

To run this notebook, you will need the following:
- Python 3.5/3.6 (via Anaconda)
- Tensorflow > 1.2
- Keras
- cython
- opencv
- easdict

Image processing tasks will be much faster with a GPU and GPU enabled tensorflow.

To install tensorflow_gpu, follow the instructions at: [tf_GPU link](link.com)

If you do not have access to a local GPU, you can get access to a GPU enabled VM through [AWS](aws.com), [Azure](portal.azure.com), or services such as [Paperspace](paperspace.com). 

[Google CoLab](google.com) will also give you access to a K80 accelerator.


### Acknowledgements / Thefts

Most of the code below is not my own. Cats & Dogs dataset stolen from [fast.ai](fast.ai), Faster R-CNN implemenation stolen from [smallcorgi](smallcorgi.githum)


## Neural Networks in 5(ish) minutes

For more detailed explanations, I recomend Brandon Roeher's materials on [artificial neural networks](Link.com) and [convolutional neural networks](link.com).


#### Weights, Activations, Backpropigation

For starters: matrix multiplication, dot products, and activation functions. 

In a _linear regression_ with two inputs (x_1, x_2), we could represent a model as y = w_1*x_1 + w_2*x_2
In a _logistic regression_, we would take our same inputs, multiply each of them by a weight, and transform them with a logit function - y = 1/(1+e^-(w_1*x_1 + w_2*x_2))


![Ann](./ann.png)

In a neural network, we perform a similar action - at every node. We take an input vector x (x_1...x_n), take the dot product of our input vector and a weight vector, and then transform the output with a function. A simple network with 3 inputs (X1,X2,X3), and 2 neurons (called perceptrons) using a sigmoid activation would have 6 weights (2 sets of 3 weights), and calculate 2 sigmoid functions.

![Ann2](./ann2.gif)

While sigmoid activations are the traditional example (tanh, softplus, etc), linear activations such as Rectified Linear Units (ReLu) have become popular due to their advantages in training.

This is the forward pass - in order to optimize and train the network, we propagate the error (difference between our calculated value and the intended "true" value) backwards through the network. **Very** loosely - if we think of each layer as being dependent on previous layers, we can adjust the weights at each layer with respect to weights further in the network by taking the partial dervative of the activation at each layer and adjusting the weight with respect to the gradient (_descending the gradient_).

[Andrew Ng on backprop](https://www.youtube.com/watch?v=mOmkv5SI9hU)

#### Convolutions

For computer vision and image based tasks, we are often dealing with CNNs, or convolutional networks. In a CNN, there are two important layers: convolutional layers (passing kernels over an image) and pooling layers (summarizing layers, usually taking the maximum, to down sample)

Convolution:
![CNN](./conv.png)

Convolution Process:
![CNN](./conv.gif)

Pooling:
![Pool](./pool.png)

In practice, we use banks of filters - at each block of convolutional layers, there may be 100's of 3x3 filters. In a traditional MLP network, we are training the weights. In a convolutional network, we are training the kernels.

[Stanford Lecture on CNN](https://www.youtube.com/watch?v=AQirPKrAyDg)

### Architecture - VGG

![VGG Arch](./vgg16.png)




[ResNet Arch]


### Training

VGG and ResNet were both designed for the [ImageNet](imagenet.com) image classification dataset - with the goal of achieving the best possible top-5 likelihood score for 1000 possible classes for a given image. The networks were trained for days on GPUs to achieve their optimal weights, and one can imagine that the over-all training time during development can be accurately measured in weeks.

**Our problem: We don't have days and weeks to spend training networks (assuming our architecture works)**

What we can do, is jump-start our process by using the layers and pre-trained weights from proven architectures like VGG or ResNet, and either adding new layers which we train on our new training data, or selectively training layers in the existing architectures to adjust them to our specific needs. In this fasion, we can cut the training time needed to produce a production ready architecture down from days to minutes or hours.

# Example 1: Cats, Dogs & VGG

In this example, we'll use VGG16 with weights pre-trained on the imagenet dataset to jump-start our training process. We can selectively un-freeze layers to adjust how much we're customizing the network to our needs, or add new layers that we'll train to our specific problem - identifying cats and dogs.

