## Review
* What's overfitting?
    * Regularization
* Normalization
    * Use the same STD and Mean for Test set
* MSE and MAE?
* Softmax?
* Bias vs. Variance trade-off
* Exponentially Moving Weighted Average [(read more)](https://medium.com/datadriveninvestor/exponentially-weighted-average-for-deep-neural-networks-39873b8230e9)

**assuming** human-error (optimal / bayes error) ~ 0%


        Error
        
    Train |       1 %      |      15 %        |       15 %       |       0.5 %

------
    Test  |       11 %     |      16 %        |       30 %       |        1 %

-----
                      High Variance  |   High Bias      |   High Bias       |      Low Bias
                      (Overfitting)  | (Underfitting)   |   High Variance   |      Low Variance

* What if optimal error is 15%?

## Initializating Weights

* What parameters need initializations?
* Can we initialize our weights in NeuralNetworks as `Zero`? What will happen? (Symmetry breaking problem)
    * Both the forward and backward pass of two neurons will be identical (?)
    * If both are identical, hence they are computing the same function!
* Why can't we initialize the weight matrices as we like?
    * Vanishing / Exploding gradient
* Type of random initializations
   

In [None]:
import numpy as np
import torch.nn as nn

In [None]:
# Random Weight Initialization
w = np.random.randn((2,2)) * 0.01 # Why 0.01? Saturation -> Small Gradient Descent -> Slow Learning

In [None]:
# Glorot / Xavier Uniform and Normal 
F_in = 64 # Fan in
F_out = 32 # Fan out
limit = np.sqrt(2 / float(F_in + F_out))
W = np.random.normal(0., limit, size = (F_in, F_out))

limit_uni = np.sqrt(6 / float(F_in + F_out))
W_uni = np.random.uniform(low = -limit, high = limit, size = (F_in, F_out))

In [None]:
# He et al. / Kaiming Uniform and Normal
# Deep Networks where ReLU like activation functions are in-use

F_in = 64
F_out = 32
limit = np.sqrt(6 / float(F_in))
W = np.random.uniform(low = -limit, high = limit, size = (F_in, F_out))

limit_norm = np.sqrt(2 / float(F_in))
W_norm = np.random.randn(0., limit, size = (F_in, F_out))

## Regularization

    “Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. These strategies are collectively known as regularization.” – Goodfellow et al.

While loss function determines how well/poorly we are doing on the given task, and how *gradient descent* updates our weight parameters, both has nothing to do with the `look` of weight matrix. Taking into account that there are almost *infinite* amount of parameters that will achieve virtually the same accuracy.


So, How we choose set of *parameters* so that our model **generalize** well? In other words, reduce the amount of overfitting.



### Why Regularization?
* Generalization: Making sure that our models can better handle the unseen data.
* Risks of generalization: Too much generalization leads to *underfitting*, in which model could not pick the relationship between input and output, and has a poor performance.



### Why it works?

* Simplifies the network (kind of wrong analogy)
* Reduces the effect of neurons

### Types of Regularizations

* L1 Regularization (Ridge)
* L2 Regularization (Lasso / Weight Decay)
* Elastic Net

Reguralizations that can be **explicitly** added to the network such as **dropout**

* Dropout
    * Trains small parts of the network
    * Higher drop-rate for complex layers (layers with too many neurons)

Or, **implicit** ones that are applied *during* the training process. E.g. **Augmentation**, and **Early Stopping**

# CNNs

* Traditional NN: We used Fully-Connected layers
    * FC: Each neuron in layer *`i`* is connected to every neuron in layer *`i - 1`*
* Convolution NN: Atleast, one of those FC layers is replaced with *Convolutional* layer

### Why convolutions?

1. Sparse Interactions
    1. Using smaller # of inputs to extract meaningful structures from data (Image m x n --> Kernel k x n)
    
    <img src="./img/sparse_connection_cnn.png" width = "400" height = "350">
    [source](IanGoodFellow)

2. Parameter Sharing
    1. Weights, in kernels, are shared among all inputs  (**bold** arrows demonstrate shared params)
    
    <img src= "./img/param_sharing_cnn.png" width = '400' height = '350'>

3. Equivariant Representations
    1. Layer is unaffected by images subject to translation

### What's convolution?

#### Kernels


* Tiny matrices (compared to the Original image) that are slided over the big matrix

* Why the size is odd? (3x3, 5x5)
    * Locate origin in each matrix.

<img src='./img/kernels_odd_cnn.png' width = '400' height = '300'>

* Element wise matrix multiplication and sum of its elements!
<img src='./img/3D_Convolution_Animation.gif' alt='https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=27&cad=rja&uact=8&ved=2ahUKEwjon9OqtPTkAhVDaFAKHZvJCocQFjAaegQIAxAB&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3A3D_Convolution_Animation.gif&usg=AOvVaw2CiZ1QzGPnsWoB-_pOmzH6'>

#### Output Image Dimensions

[(visualization)](https://github.com/vdumoulin/conv_arithmetic)
   
    Image Size : N x N

    Kernel / Filter Size : F x F

    Output = (N - F + 1) x (N - F + 1)

##### Padding

* What's Padding?
    * Add zeros to borders
* Why padding?
    * Image shrinkage
    * Less data on corners
* Valid and Same Convolutions
    * Valid: **No** Padding
    * Same: **Same** size as input
* What's the output size if we have padding of size `p`    

        (N + 2P - F + 1) x (N + 2P - F + 1)
    * why 2p?
* Padding formula (for same output dim)

        P = ( F - 1 ) / 2

##### Stride

* What happens to the output size if we have stride? (s = stride)

$$ {\lfloor}{\frac{n + 2p - f}{s} + 1}{\rfloor} $$


* Convolution over volume

        Image N x N x C (channels)
        Filter F x F x C

If we have 10 filters of kernels with size (3 x 3 x 3) in one layer of neural network, how many parameters we have?

    3 x 3 x 3 = 27

    27 x 10 = 270

    270 + 10 = 280 parameters
                (each filter has a bais)

### Convolution in DL

Most of what we have used are hand-crafted kernels which was used specifically for each image processing task, but now we can use machine learning to learn values of the kernel *automatically!*

### Types of Layer

* Convolutional (CONV)
* Activation (ACT, or RELU)
* Pooling (POOL)
* Fully-Connected (FC)
* BatchNorm (BN)
* Dropout (DO)

# References

1. Deep Learning, Ian Good Fellow
2. Deep Learning for Computer Vision, Adrian
3. Coursera Deep Learning, Andrew Ng.