### CS231 NOTES

In [6]:
import numpy as np

### Traning Neural Networks I

* **Convolution:** Elementwise multiplication and sum. Or we can mathematically write as a dot product of strecthed inputs and kernel weights


* **Input (Convolution) Output Size**
    - (N -F) / stride + 1
    - stride: how many steps taken during image scan
    - N: nxn image
    - F: fxf kernel size
    
* **Subsampling** Most people use larger stride in convolutions to subsample rather than pooling, when using maxpooling common is to use 2x2 with 2 strides.

In [1]:
# W : dimension (square)
# F : kernel size (square)
# S : stride
# P : padding

def conv_calc(W, F, S, P): return (W - F + 2*P) / S + 1
def pool_calc(W, F, S): return (W - F) / S + 1

### Activation Functions

- **Sigmoid: 1 / (1 + exp(-x))**

Interpreted as a firing rate of neuron. (Good)

Can kill the gradients for very small and large numbers (Bad)

Output is not zero centered, weights will always increase or decrease (Bad)

Exp is computationally expensive

- **Tanh(x):**

Same problems with sigmoid but zero centered

- **Relu: max(0, x)**

Does not saturate

Very efficient

Converges much faster

Biologically more plausible than sigmoid

Not zero centered

When x < 0 zero gradient, never activate and update

Dead RELUs

- **Leaky Relu max(0.01x, x)**

Does not saturate

Efficient

Converge fast

Will not die, always activate


- **Parametric Rectifier (PRelu) max(alphax, x)**

alpha is learned with backprop

- **Exponential Linear Units (ELU)**

All benefits of ReLU

Closer to zero mean outputs

Negative saturation regime compared with Leaky Relu

Adds robustness to noise

Computation requires exp

- **Maxout**

Doubled parameters





### Data Preprocessing

- For images zero centering (Apply same mean from train to test)
- Subtract the mean image (32, 32, 3) array - AlexNet
- Subrtact per-channel mean (mean along each channel = 3 numbers) - VGG

### Weight Initialization

- w = 0, all outputs will be the same and perfect symmetry over all weights (Bad)

- All weights to be random gaussian - problem for deeper networks. As we move from first layer to deeper weights become 0.

- Xavier Initialization (first try this)

- More important as network gets deeper

### Batch Normalization

- For a batch of activations at some layer, normalize by emprical mean and std
- Usually inserted after FC and CONV layers

- Normalization is important since data is scaled, and loss function becomes less sensitive to pertubrations

- Always normalize after each layer outputs

- Batch Normalization is an additional layer that normalize data

### 1) HYPERPARAMETER SEARCH 
 
- It's better than grid search, grid search makes one parameter sensitive to others

- Coarse to find search, narrowing down your search space iteratively

- lr, model sizes, regularization, but mostly problem dependent

### Training Neural Networks II

### Fancier Optimization

### Problems with SGD

- It may cause very slow progress along shallow dimension, jitter along steep direction (taco shell)

- Loss function has high condition number, largest/smallest singular value of Hessian Matrix

- As dimension gets larger the problem gets worse since condition numver becomes even larger

- **Saddle Points**: SGD gets stucked since gradient zero. Happens very frequently as dimension increases. Optimization can't escape saddle points or will be at local minima.

### SGD + Momentum

- Build up velocity as running mean or gradients
- Rho gives "friction" typically 0.9 or 0.99
- Helps with condition number problem

### Nesterov Momentum 

- Take a step by velocity
- Compute gradient at that point
- Take a step in gradient direction

### AdaGrad

- Keep running sum of gradients
- In update divide by this term
- With very high condition number adjusts the gradient
- Its always shrinks, RMSProp is better in this context

### Adam

- Sort of like RMSProp with momentum
- beta1 = 0.9
- beta2 = 0.99
- lr = 1e-3 - 1e-4 good start

We don't have to keep the same learning rate we can change learning rate during training. Learning rate decay (common with SGD Momentum).

First choose a good learning rate, check your loss curve then decide where you might need learning rate decay.

**Second-Order Optimization Newton parameter update**

Bad for deeplearning inverting is O(N**3)


### LBFGS

- If you can afford to do full batch updates then try out (and discard sources of noise since this method is not prone to probabilistic framework)

### 2) CREATING BETTER MODELS

#### A) MODEL ENSEMBLES

- Train independent models
- At test time average their results (+%2 performance)

#### B) SNAPSHOT ENSEMBLE

- Cyclic Learning Rate Schedules
- Taking snapshots at different minimas and averaging them
- Training the model once 

#### C) POLYAK AVERAGING

- Keep moving average of the parameter vector and use this at test time
- And use this vector at test time

#### D) REGULARIZATION

- L2 
- L1
- Elastic Net

- For neural networks:

#### DROPOUT

- Dropoout (cancel some activations with Pr = p) - don't effect next layer
- More common in FC layers but may also be used in CONV
- Dropout helps with co-adaptation of features, generalizes
- It's like a gigantic ensemble within a single model
- At test time, multiply by dropout probability

#### ADD NOISE

- Training: Add some randomness
- Test: Average out randomness

#### BATCH NORMALIZATION

- Training: Normalize using stats from random minibatches
- Testing: Use fixed stats to normalize

#### DATA AUGMENTATION

- Traning: Sample random Flips, crops, ...
- Test: Average 5 augmented predictions
- You can go crazy with this, mixed combinations without changing the label

#### DROP CONNECT

- Rather than zero outing activations this time we are zero outing weights from weight matrix

#### FRACTIONAL MAXPOOLING

- Training: Add random noise
- Test: Marginalize over the noise
- Every pooling layer will have random regions

#### STOCHASTIC DEPTH

- Training: Randomly dropping layers
- Test: Use whole network

In [9]:
### Regularization with dropout
def train_step(X):
    '''X contains data'''
    
    # Forward pass example for 3-layer neural network
    H1 = np.maximum(0, np.dot(W1, X) + b1) # first activation with ReLU
    U1 = (np.random.rand(*H1.shape) < p) / p # first drop out
    H1 *= U1 # drop
    H2 = np.maximum(0, np.dot(W2, H1) + b2)
    U2 = (np.random.rand(*H2.shape) < p) / p
    H2 *= U2
    out = np.dot(W3, H2) + b3
    
def predict(X):
    # ensembled forward pass
    H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary
    H2 = np.maximum(0, np.dot(W2, H1) + b2)
    out = np.dot(W3, H2) + b3

### 3) TRANSFER LEARNING

- Take pretrained network
- Change last FC layer (or n last FC layers)
- Good with small data
- If you have enough data you can fine tune the whole FC part of the network after updated the last layer (freeze CONV layers)

**GUIDE TO DEAL WITH CASES BY PRETRAINED NETS:**

- **very little data & very similar dataset (to ImageNet)**
    - Use linear classifier on top layer
    - Basically just update the last FC layer 
- **lot of data & very similar dataset**
    - Fine tune a few layers (FC with convs freezed)
- **lot of data & very different dataset**
    - fine tune a larger number of layers after update (may include conv parts - lrs with differential learning rates)
- **very little data & very different dataset**
    - Need to be more creative since last layer won't help with this case. Re-initilialize more layers and experiment. (TC scans, MRIs)
    
    
Download a pretrained model close to your model,  update last layer or fine tune.
    
    

### 4) FASTAI TRICKS

**1)** lr_find(): Increases lr until loss increases so that we can find the optimal learning rate that will allow us to converge fast enough while not jumping over and passing a minima.

**2)** Stochastic Gradient Descent with Restarts(SGDR): Decreases lr gradually to slowly converge as we are getting closer to a minima then increases back to starting level to escape local minimas in order to find a better minima.