# Tuning Back Propagation

Description: 

> Vanishing and exploding gradients
> 
> - The delta computed should be the right size for gradual descent
> - Too small -> Decay, no changed to weights
> - Too big -> Choppy with no descend 

> **Gradient Descent**
> 
> - The gradient descent algorithm multiplies the gradient by a scalar known as the learning rage to determine the next point
> 
> **Solution**
> 
> - Weight initialization
> - Activation functions
> - Batch normalization

## Batch Normalization

- Normalize the inputs each hidden layer
- Center and scale (StandardScaler)
- Normalize the inputs to be the same scale
- Helps attain higher accurate with lower epochs
- Additional computation and increased inference times

In [None]:
from src.utils import base_model_config
from src.utils import create_and_run_model
from src.utils import plot_graph
from src.utils import get_data

In [None]:
# Batch Normalization Experiment

accuracy_measures = {}

normalization_list = ['none','batch']
for normalization in normalization_list:
    
    model_config = base_model_config()
    X,Y = get_data()
    
    model_config["NORMALIZATION"] = normalization
    model_name="Normalization-" + normalization
    history=create_and_run_model(model_config,X,Y,model_name)
    
    accuracy_measures[model_name] = history.history["accuracy"]
    
plot_graph(accuracy_measures, 'Batch Normalization')

### After running the experiment, we obtain the following results:
- The model which uses batch normalization has boost in accuracy on early epochs
- Optimal epochs are reduced (max 4-6 epochs)

# Optimizers

> Description:
> 
> - Regular gradient descent can be slow
> - Takes a lot of time to get closer to the desired accuracy
> - More training time and resources
> - Limited training data may also impact gradient descent
> - Optimizers help speed up the training process
> - Changes the delta value to get closer to desired state

**Available Optimizers**

1. SGD (Stochastic Gradient Descent) - best for shallow networks
2. RMSprop (Root Mean Square Propagation) - best for deep networks
3. Adam (Adaptive Moment Estimation) - best for deep networks
4. Adagrad (Adaptive Gradient Algorithm) - best for sparse data

In [None]:
# Optimizer experiment

accuracy_measures = {}

optimizer_list = ['sgd','rmsprop','adam','adagrad']
for optimizer in optimizer_list:
    
    model_config = base_model_config()
    X,Y = get_data()
    
    model_config["OPTIMIZER"] = optimizer
    model_name = "Optimizer-" + optimizer
    history=create_and_run_model(model_config,X,Y, model_name)
    
    accuracy_measures[model_name] = history.history["accuracy"]

plot_graph(accuracy_measures, 'Optimizers')

### After running the experiment, we obtain the following results:
1. The Adam optimizer performs better than the rest at early epochs (but unstable)
2. Rmsprop optimizer is the second best (more stable)

# Learning Rate

> Description:
> 1. Rate at witch the weights are change in response to the estimated error
> 2. Works in conjunction with the optimizer
> 3. Numeric value used to adjust the delta computed


**Learning Rate Selection**
- Large learning rate -> Faster learning with fewer epochs, Risk of exploding gradients
- Small learning rate -> Slower learning, Risk of vanishing gradients

In [None]:
# Learning Rate Experiment

accuracy_measures = {}

learning_rate_list = [0.001, 0.005,0.01,0.1,0.5]
for learning_rate in learning_rate_list:
    
    model_config = base_model_config()
    X,Y = get_data()
    
    model_config["LEARNING_RATE"] = learning_rate
    model_name="Learning-Rate-" + str(learning_rate)
    history=create_and_run_model(model_config,X,Y, model_name)
    
    #accuracy
    accuracy_measures[model_name] = history.history["accuracy"]

plot_graph(accuracy_measures, "Compare Learning Rates")

### After running the experiment, we obtain the following results:

1. Learning rate of 0.01 provide boost accuracy on early epochs (but not stable)
2. Learning rate of 0.005 is the best for the model (more stable)