# Models

## Deep Learning
At this point I'm sure you have a somewhat intuitive knowledge of what machine learning and neural networks are. This may make you ask the question 'how can I improve my model?'. The most basic approach is to add extra layers. Deep learning is the idea of scaling a neural network model to include many layers. Consequently the addition of more layers from a theoretical perspective is able to improve the ability of the model to extract features.

Despite the advantages of deep learning it suffers from various challenges, from the requirement of a large dataset to the demands of computational resources. This section will explore a variety of the problems more unique to deep learning and possible solutions which can be applied to most models regardless of architecture.

### Vanishing Gradient
The vanishing gradient problem represents the cost of expanding the size of a network to more layers. This is caused by the use of activation functions which may have a small gradient. Which is made worse by the process of backpropagation which multiplies these low gradients together to produce an even smaller result.

The vanishing gradient problem is common in the sigmoid or tanh function as for low and high activation values the gradient is close to zero. For example, using the function $f(x)=\tanh(x)$ and the activation value $5$ we find:
$$f'(x)=\text{sech}^2(x)=\text{sech}^2(5)=0.00018$$

This value becomes smaller if we backpropagate as $0.00018\times0.00018\times\cdots$ will produce a very small value causing the model to stagnate in its ability to learn.

The vanishing gradient problem can be observed in a few ways. A common way is the loss function remaining the same, especially if this plateau occurs early into training. Another is through observation of the gradients or activation values during training. This being achievable through tooling.

#### Addressing Vanishing Gradient
Solving the vanishing gradient problem has various solutions, some of which are discussed later. For now we'll focus on the simpler approaches of ReLu and gradient clipping.

As the cause of the issue is the activation function the most common approach to solving the problem is **changing the activation function to limit low gradients**. The most well-known being the ReLu function which has a gradient equal to 0 for $x<0$ and 1 for $x>1$, therefore reducing low gradients.

**Gradient clipping** is another technique for solving the problem which is done through restricting gradients during backpropagation. This eliminates issues caused by both diminishing and exploding gradients.

### Exploding Gradient
The exploding gradient is similar to the vanishing gradient problem, however rather than growing arbitrarily smaller it grows larger. The problem can be observed, outside of using tooling, through seeing the loss function oscillating or large (and NaN) activation values and weights. Due to its similarity to the vanishing gradient it can also be solved with gradient clipping and lowering the learning rate.


### Dying ReLu
Despite the use of a ReLu function to solve the vanishing gradient problem we may still find the gradients becoming small and the loss value plateauing. This is caused by the related dying ReLu problem which is the result of having many negative activation values. There are two main approaches to solving this, those being:
- *Decreasing the learning rate*, which results in less negative values during updating neuron weights.
- *Modified ReLu functions*, provide a solution to this problem by usually removing. Some examples include leaky ReLu, Parametric ReLu, and exponential ReLu.

## Batch Normalisation
Explain how it solves exploding/vanishing gradient.

## Weight Initialisation
Explain how it helps solve exploding/vanishing, but also explain its benefits in general.