# Deep Learning

Example : Handwriting Digit Recognition

In deep learning, the function *f* is represented as neural network

<hr>

## Elements of neural network
-  Neuron: a function that takes input and produces output
-  Layer: a set of neurons
-  Network: a set of layers
- Bias: a parameter that is added to the output of a neuron
- Weight: a parameter that is multiplied to the output of a neuron
- activation function: a function that is applied to the output of a neuron
- Input layer: the first layer of a network
- Output layer: the last layer of a network
- Hidden layer (Deep layers): a layer between the input and output layers
- Cost function: a function that measures the difference between the output of a network and the desired output

Difference between weight and bias
- Bias is added to the output of a neuron
- Weight is multiplied to the output of a neuron

output = activation_function(weight * input + bias)

<hr>

## Activation functions
-  Sigmoid function: $f(x) = \frac{1}{1+e^{-x}}$
-  Tanh function: $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
-  ReLU function: $f(x) = max(0, x)$
-  Softmax function: $f(x) = \frac{e^x}{\sum_{i=1}^n e^x}$
-  Linear function: $f(x) = x$

<hr>

## Layers
-  Dense layer: a layer that is fully connected to the previous layer
-  Convolutional layer: a layer that is connected to a subset of the previous layer
-  Pooling layer: a layer that is connected to a subset of the previous layer
- Softmax layer: a layer that is connected to the previous layer

<hr>

## Propagation
-  Forward propagation: the process of calculating the output of a neural network
-  Backward propagation: the process of calculating the gradient of the loss function with respect to the parameters of a neural network

<hr>

## Cost functions
 Cost can be Euclidean distance, cross entropy, etc.

Use gradient descent to minimize the cost function. (Backpropagation)

Plateau: a region where the gradient is small but the cost function is not minimized.

Saddle point: a point where the gradient is zero but the cost function is not minimized.

Using momentum to avoid plateau and saddle point. Note that momentum is not a method to avoid plateau and saddle point, but a method to accelerate the convergence of gradient descent. Not guaranteed to reach the global minimum, but gives hope.

<hr>

## Optimization
-  Gradient descent: a method to minimize the cost function
-  Momentum: a method to accelerate the gradient descent
-  RMSprop: a method to accelerate the gradient descent
-  Adam: a method to accelerate the gradient descent
-  Learning rate: a parameter that controls the step size of the gradient descent
-  Batch size: a parameter that controls the number of samples used to calculate the gradient
-  Epoch: a parameter that controls the number of iterations of the gradient descent

<hr>

## Mini-batch gradient descent

-  Batch gradient descent: use all the samples to calculate the gradient
-  Stochastic gradient descent: use one sample to calculate the gradient
-  Mini-batch gradient descent: use a subset of the samples to calculate the gradient

Mini-batch is faster and better

<hr>

## Why deeper better

Any continuos function f,
$ f: R^n \rightarrow R^m $,
can be represented by a neural network with one hidden layer with n neurons.

- Fat Layer: a layer with many neurons
- Thin Layer: a layer with few neurons
- Short Layer: a layer with few layers
- Tall Layer: a layer with many layers

- Shallower network: a network with few layers
- Deeper network: a network with many layers

Input layer + Learnable kernal + Simple Classifer = Deep Learning

<hr>

# ReLU
ReLu is better than sigmoid because it is faster to compute and it does not saturate.

$f(x) = max(0, x)$

Vanishing gradient problem: the gradient of the sigmoid function is small, which makes it difficult to train the network.

In ReLU, vanish gradient problem does not exist.

Pros
- Fast to compute
- No Vanishing gradient problem
- Thinner network

## Maxout
$f(x) = max(w_1^Tx + b_1, w_2^Tx + b_2)$

ReLU is a special case of Maxout.

- Learnable activation function: a function that is learned by the network

<hr>

## Learning rate

- Learning rate: a parameter that controls the step size of the gradient descent

if LR is large, cost may not decrease after each update.

if LR is small, Training would be slow

## Adagrad
Adagrad is an adaptive learning rate method.

- Adaptive learning rate: a method that changes the learning rate during the training

- Derivative: the gradient of a function

Smaller derivative means larger learning rate. Larger derivative means smaller learning rate.

## Other

- Adam: a method to accelerate the gradient descent
- RMSprop: a method to accelerate the gradient descent

<hr>

## Dropout

- Dropout: a method to prevent overfitting.

 Dropout is only in training, not in testing.

 Dropout is only in hidden layers, not in input and output layers.

 It is a kind of ensemble method.

 - Ensemble method: a method that combines multiple models to make a prediction

<hr>

## Convolutional Neural Network

- Convolutional layer: a layer that is connected to a subset of the previous layer
- Pooling layer: a layer that is connected to a subset of the previous layer
- Convolutional neural network: a neural network that contains convolutional layers and pooling layers

Multiple convolutional layers and pooling layers are stacked together to form a convolutional neural network.


Difference between convolutional layer and fully connected layer
- Convolutional layer is connected to a subset of the previous layer
- Fully connected layer is fully connected to the previous layer

Each convolutional layer has a filter

- Filter: a matrix that is used to calculate the output of a convolutional layer

- Feature map: the output of a convolutional layer

- stride: the step size of the filter

Fully connected layer does not have filter. It is calculated by weight and bias.

In convolutional neural network, weights are shared across the input.

Input image -> Convolutional layer -> Max pooling layer -> Convolutional layer -> Max pooling layer -> Flatten -> Fully connected Feedforward Network -> Output

- Max pooling: a method to reduce the size of the feature map
- Flatten: a method to flatten the feature map to a vector
- Feedforward: the process of calculating the output of a neural network

 Max pooling is used to reduce the complexity of the network.. 

CNN in Keras

```python
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D, Flatten

model = Sequential()
model.add(Conv2D(25, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(50, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(10, activation='softmax'))
```

<hr>

<img src="assets/cnn1.png">
<img src="assets/cnn2.png">
<img src="assets/cnn3.png">

<hr>

## Early stopping

Plot 2 graphs: training loss and validation loss.

loss vs epoch

Early stopping is done when the validation loss does not decrease for a certain number of epochs.

At some point, validation loss will start to increase. This is the point of overfitting. We should stop training at this point. This is called early stopping.

<img src="assets/earlystop.jpg">