# Setting up the data and the model

## TL;DR

- The recommended preprocessing is to center the data to have mean of zero, and normalize its scale to [-1, 1] along each feature
- Initialize the weights by drawing them from a gaussian distribution with standard deviation of $\sqrt(2/n)$, where $n$ is the number of inputs to the neuron. 
  - E.g. in numpy: `w = np.random.randn(n) * sqrt(2.0/n)`.
- Use L2 regularization and dropout (the inverted version)
- Use batch normalization



## Data Preprocessing

Data matrix: `X`, size `[N x D]`

- `N`: number of data
- `D`: dimensionality of data

Three common forms of data preprocessing `X`:

- Mean subtraction
- Normalization
- PCA and whitening

### **Mean subtraction**

- subtracting the mean across every individual *feature* in the data

- geometric interpretation of *centering the cloud of data around the origin* along every dimension

- Numpy implementation:

  ```python
  X -= np.mean(X, axis=0)
  ```

### **Normalization**

- normalizing the data dimensions so that they are of approximately the *same scale*

- Two ways:

  - divide each dimension by its standard deviation, once it has been zero-centered

    ~~~python
    X /= np.std(X, axis=0)
    ~~~

  - normalizes each dimension so that the min and max along the dimension is -1 and 1 respectively

- Makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm.

- In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255)

  $\rightarrow$ It is not strictly necessary to perform this additional preprocessing step.

<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/prepro1.jpeg" alt="img" style="zoom:50%;" />

### **PCA and Whitenining**

- The data is first centered as described above.

- Then compute the covariance matrix that tells us about the correlation structure in the data

  ~~~python
  # Assume input data matrix X of size [N x D]
  X -= np.mean(X, axis = 0) # zero-center the data (important)
  cov = np.dot(X.T, X) / X.shape[0] # get the data covariance matrix
  ~~~

  - The $(i, j)$ element of the data covariance matrix contains the *covariance* between i-th and j-th dimension of the data
  - The diagonal of this matrix contains the variances
  - The covariance matrix is symmetric and [positive semi-definite](http://en.wikipedia.org/wiki/Positive-definite_matrix#Negative-definite.2C_semidefinite_and_indefinite_matrices)

- We can compute the SVD factorization of the data covariance matrix

  ~~~python
  U,S,V = np.linalg.svd(cov)
  ~~~

  - Columns of `U`: eigenvectors 
    - orthonormal vectors (norm of 1, orthogonal to each other)
    - can be regarded as basis vectors
  - `S`: 1-D array of the singular values

- We project the original (but zero-centered) data into the eigenbasis

  ~~~python
  Xrot = np.dot(X, U)
  ~~~

  - `Xrot` is diagonal

- In `np.linalg.svd`, its returned value `U`, the eigenvector columns are sorted by their eigenvalues

  $\rightarrow$ We can use this to reduce the dimensionality of the data by only using the top few eigenvectors, and discarding the dimensions along which the data has no variance.

  - also refereed to as [Principal Component Analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) dimensionality reduction

  ~~~python
  Xrot_reduced = np.dot(X, U[:,:100]) # Xrot_reduced becomes [N x 100]
  ~~~

  - 👆reduced the the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the *most* variance
  - can get very good performance by training linear classifiers or neural networks on the PCA-reduced datasets, obtaining savings in both space and time. 👏

#### Whitening

- takes the data in the eigenbasis
- divides every dimension by the eigenvalue to normalize the scale
- Geometric interpretation: if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix

~~~python
# whiten the data:
# divide by the eigenvalues (which are square roots of the singular values)
Xwhite = Xrot / np.sqrt(S + 1e-5)
~~~

- `1e-5`: prevent division by zero

<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/prepro2.jpeg" alt="img" style="zoom:50%;" />



### Summary

In practice, PCA/Whiteninig are NOT used with Convolutional Networks. 😭

However, it is very important to zero-center the data, and it is common to see normalization of every pixel as well.

**Note**: 

An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must **only be computed on the training data**, and then applied to the validation / test data. 

I.e.: the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).



## Weight Initialization

After constructing a Neural Network architecture and preprocessing the data,

we have to initialize its parameters before we can begin to train the network.

### Pitfall: all zero initialization ❌

NEVER set all the initial weights to zero!!!

- if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact *same* parameter updates. 
- In other words, there is *no source of asymmetry* between neurons if their weights are initialized to be the same.

### Small random numbers

Initialize the weights of the neurons to **small numbers** and refer to doing so as *symmetry breaking*.

- Idea: 

  - neurons are all random and unique in the beginning
  - they will compute distinct updates and integrate themselves as diverse parts of the full network

- E.g.: 

  ~~~python
  W = 0.01 * np.random.randn(D, H)
  ~~~

  every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian

**Warning**: It’s not necessarily the case that smaller numbers will work strictly better.

### Calibrating the variances with 1/sqrt(n)

Problem of small random numbers initialization:

- the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs

Solution: normalize the variance of each neuron’s output to 1 by scaling its weight vector by the square root of its *fan-in* (i.e. its number of inputs).

- Initialize each neuron's weight vector as

  ~~~python
  w = np.random.randn(n) / sqrt(n)
  ~~~

  - `n`: number of inputs

- ensures that all neurons in the network initially have approximately the same output distribution

  $\rightarrow$ empirically improves the rate of convergence.

### Sparse initialization

- Set all weight matrices to zero, 
- but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

### Initializing the biases

- Common: simply 0 bias initialization
- For ReLU
  - can use small constant value (such as 0.01) for all bias 
    - ensure all ReLU units fire in the beginning

### Current recommendation in practice 💪

- Use ReLU units

- Initialize with 

  ~~~python
  w = np.random.randn(n) * sqrt(2.0/n)
  ~~~

  (as discussed in [He et al.](http://arxiv-web3.library.cornell.edu/abs/1502.01852).)

### Batch normalization

- Explicitly force the activations throughout a network to take on a unit gaussian distribution at the beginning of the training
- In practice usually insert the BatchNorm layer immediately after fully connected layers (or convolutional layers)
- Use Batch Normalization are significantly more robust to bad initialization
- Batch normalization can be interpreted as doing preprocessing at every layer of the network, but integrated into the network itself in a differentiable manner. 

## Regularization

Controlling the capacity of Neural Networks to **prevent overfitting **(can be seen as penalizing some measure of complexity of the model).

### L2 regularization

- The most common form of regularization
- Penalize the squared magnitude of all parameters directly in the objective
  - For every weight $w$ in the network, add the term $\frac{1}{2}\lambda w^2$ to the objective
    - $\lambda$: regularization strength
- Heavily penalizing peaky weight vectors and preferring diffuse weight vectors

### L1 regularization

- Another relatively common form of regularization
- For each weight $w$ we add $\lambda |w|$ to the objective
- Leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero)

  - I.e., neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs
- We can combine L1 regularization with L2 regularization: $\lambda_1 |w| + \lambda_2 w^2$ (a.k.a  [Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2 (2005) 301-320 Zou & Hastie.pdf))

### Max norm constraints

- Enforce an absolute upper bound on the magnitude of the weight vector for every neuron
- Use projected gradient descent to enforce the constraint
- In practice, this corresponds to 
  - performing the parameter update as normal
  - then enforcing the constraint by clamping the weight vector $w$ of every neuron to satisfy $\|w\|_2 < c$

- The network cannot “explode” even when the learning rates are set too high because the updates are always bounded.

### Dropout 👍

- Extremely effective, simple regularization technique

- Complements the other methods (L1, L2, maxnorm)

- Only keep a neuron active with some probability $p$ (a hyperparameter), or setting it to zero otherwise

- Can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. 

  - Note: the exponential number of possible sampled networks are *NOT independent* because they share the parameters

- During testing there is NO dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (similar to ensemble learning)

  <img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/dropout.jpeg" alt="img" style="zoom:80%;" />

Example: Vanilla dropout in an 3-layer Neural Network:

~~~python
""" Vanilla Dropout: Not recommended implementation """

p = 0.5 # probability of keep a unit active (higher p = less dropout)

def train_step(X):
  """ X contains the data"""
  
  # forward pass 
  H1 = np.maximum(0, np.dot(W1, X) + b1)
  U1 = np.random.rand(*H1.shape) < p # first dropout mask
  H1 *= U1 # drop
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  U2 = np.random.rand(*H2.shape) < p # second dropout mask
  H2 *= U2 # drop
  out = np.dot(W3, H2) + b3
  
  # backward pass: compute gradients... (not shown)
  # perform parameter update... (not shown)
  
def predict(X):
  # ensembled forward pass
  H1 = np.maximum(0, np.dot(W1, X) + b1) * p # scale the activations
  H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # scale the activations
  out = np.dot(W3, H2) + b3
~~~

Note: In the `predict` function we are not dropping anymore, but we are performing a **scaling of both hidden layer outputs** by $p$. 

- This is important because at test time all neurons see all their inputs, so we want the outputs of neurons at test time to be *identical* to their expected outputs at training time.

- E.g., consider an output of a neuron $x$ (before dropout)

  - With dropout, the expected output is
    $$
    px + (1-p)0
    $$

  - At test time, when we keep the neuron always active, we must adjust $x \to px$ to keep the same expected output

- Performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction.

The undesirable property of the scheme presented above is that **we must scale the activations by $p$ at test time.** Since test-time performance is so critical, it is always preferable to use **inverted dropout**, which 

- performs the scaling at train time, leaving the forward pass at test time untouched.
- 👍 Additional pros: prediction code can remain untouched when you decide to tweak where you apply dropout, or if at all.

~~~python
""" 
Inverted Dropout: Recommended implementation example.
We drop and scale at train time and don't do anything at test time.
"""

p = 0.5 # probability of keeping a unit active. higher = less dropout

def train_step(X):
  # forward pass for example 3-layer neural network
  H1 = np.maximum(0, np.dot(W1, X) + b1)
  U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p!
  H1 *= U1 # drop!
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p!
  H2 *= U2 # drop!
  out = np.dot(W3, H2) + b3
  
  # backward pass: compute gradients... (not shown)
  # perform parameter update... (not shown)
  
def predict(X):
  # ensembled forward pass
  H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary
  H2 = np.maximum(0, np.dot(W2, H1) + b2) # no scaling necessary
  out = np.dot(W3, H2) + b3
~~~

### **Theme of noise in forward pass**

- A random set of weights is instead set to zero during forward pass ([DropConnect](http://cs.nyu.edu/~wanli/dropc/))
- Convolutional Neural Networks also take advantage of this theme with methods such as stochastic pooling, fractional pooling, and data augmentation. 

### **Bias regularization**

In practical applications (and with proper data preprocessing) regularizing the bias rarely leads to significantly worse performance. This is likely because there are very few bias terms compared to all the weights, so the classifier can “afford to” use the biases if it needs them to obtain a better data loss.

### Summary

In practice:

- Most common to use a single, global L2 regularization strength that is cross-validated.
- Also common to combine this with dropout applied after all layers. 
  - The value of $p=0.5$ is a reasonable default, but this can be tuned on validation data.



## Loss functions

In a supervised learning problem, ***data loss* measures the compatibility between a prediction (e.g. the class scores in classification) and the ground truth label.**

- Takes the form of an average over the data losses for every individual example
  $$
  L = \frac{1}{N}\sum_i L_i
  $$

  - $N$: number of training data

 Lets abbreviate $f = f(x_i; W)$ to be the activations of the output layer in a Neural Network.

### Classification

We assume a dataset of examples and a single correct label (out of a fixed set) for each example. 

Two common functions:

- SVM

  $$
  \max \left(0, f_{j}-f_{y_{i}}+1\right)^{2}
  $$

  - some people report better performance with the squared hinge loss (using $\max \left(0, f_{j}-f_{y_{i}}+1\right)^{2}$)

-  Softmax classifier that uses the cross-entropy loss

   $$
   L_{i}=-\log \left(\frac{e^{f_{y_{i}}}}{\sum_{j} e^{f_{j}}}\right)
   $$

   - Problem: large number of classes
     - When the set of labels is very large (e.g. words in English dictionary, or ImageNet which contains 22,000 categories), computing the full softmax probabilities becomes expensive. 

### Regression

The task of predicting real-valued quantities.

For this task, it is common to compute the loss between the predicted quantity and the true answer and then measure 

- the L2 squared norm
  $$
  L_{i}=\left\|f-y_{i}\right\|_{2}^{2}
  $$

- or L1 norm of the difference (formulated by summing the absolute value along each dimension)
  $$
  L_{i}=\left\|f-y_{i}\right\|_{1}=\sum_{j}\left|f_{j}-\left(y_{i}\right)_{j}\right|
  $$

### Word of caution

**The L2 loss is much harder to optimize than a more stable loss, such as Softmax.**

- The L2 loss requires a very fragile and specific property from the network to output exactly one correct value for each input (and its augmentations)
  - This is not the case with Softmax, where the precise value of each score is less important: *It only matters that their magnitudes are appropriate*.

- The L2 loss is less robust because outliers can introduce huge gradients.
- Applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea