each neural layer transform its input data as follows:

In [None]:
output = relu(dot(W, input) + b)

in this expression, 'W' and 'b' are tensors that are attributes of the layer, they're called the 'weight' and 'trainable parameters' of the layer (the 'kernel' and 'bias' attributes, respectively), these weights contain the information learned by the network from exposure to training data

initially, these weight matrices are filled with small random values (a step called random initialisation), these weights are then gradually to adjust, based on a feedback signal (training)

training loops:(repeat as long as necessary)
- draw a batch of training samples x and corresponding targets y 
- run the network on x (a step called 'forward pass') to obtain predictions 'y_pred'
- compute the loss of the network on the batch, a measure of the missmatch between 'y_pred' and y
- update all weights of the network in a way that slightly reduces the loss on this batch

eventually it will end up with a network that has a very low loss on its training data: a low mismatch between predictions 'y_pred' and y, the network has learned to map its inputs to correct targets. 

updating the network's weights requires the operation of derivates and gradients (less expensive)

##### What's a Derivative

![Derivative](./derivative.png)

##### Derivative of a Tensor Operation: the Gradient

a gradient is the derivative of a tensor operation, it's the generalisation of the concept of derivatives to function of multidimensional inputs: that is, to functions that take tensors as inputs. 

consider an input vector ‘x’, a matrix 'W', a target 'y', and a loss function 'loss' that:

In [None]:
y_pred = dot(W, x)
loss_value = loss(y_pred, y)

if the data inputs 'x' and 'y' are frozen, then this can be interpreted as a function mapping values of 'W' to 'loss value':

In [None]:
loss_value = f(W)

![Derivative of a Tensor Operation: the Gradient](./gradient.png)

##### Stochastic Gradient Descent

given a differentiable function, it's theoretically possible to find its minimum analyticaaly

applied to a neural network, that means finding analytically the combination of weight values that yields the smallest possible loss function, solving the equation:

In [None]:
gradient(f)(W) = 0 for W

This is a polynomial equation of N variables, where N is the number of coefficients in the network (often tens of millions)

Mini-Batch Stochastic Gradient Descent: (mini-batch SGD)
- draw a batch of training samples [x] and corresponding targets [y]
- run the network on [x] to obtain predictions [y_pred]
- compute the loss of the network on the batch, a measure of the missmatch between [y_pred] and [y]
- compute the gradient of the loss with regard to the network's parameters (a backward pass)
- move the parameters a little in the oppsite direction from the gradient -- for example [W -= step * gradient] -- thus reducing the loss on the batch a bit

the term 'stochastic' refers to the fact that each batch of data is drawn at random ('stochastic' is a scientific synonym of 'random')

![sgd](./sgd.png)

intuitively is's important to pick a reasonable value fot the 'step' factor

Alternatively: 
- true SGD (as opposed to mini-batch SGD):
    - draw a single sample and target at each iteration, rather than drawing a batch of data
- batch SGD (extreme):
    - run every step on all data available

each update would then be more accurate, but far more expensive, the efficient compromise between these two extremes is to use mini-batches of reasonable size

![Gradient Descent](./gradient_descent.png)

optimisation methods or optimizers:
- there exits multiple variants of SGD that differ by taking into account previous weight updates when computing the next weight update. rather than just looking at the current value of the gradient
- for instance, SGD with momentum, Adagrad, RMSProp and several others
- momentum addresses two issues with SGD: covergence speed and local minima

![Local Global Minimum](./local_global.png)

around a certain parameter value, if the parameter under consideration were being optimised via SGD with a small learning rate, then the optimisation process would get stuck at the local minimum instead of making its way to the global minimum

using momentum draws inspiraion from physics, mentally image that the optimisation process as a small ball rolling down the loss curve, if it has enough momentum, the ball won't get stuck in a ravine and will end up at the global minimum.

momentum is implemented by moving the ball at each step based not only on the current slope value (current accelaeration) but also on the current velocity (resulting from past acceleration), in practice, this means updating the parameter w based not only on the current gradient value but alos on the previous parameter update, as in this naive implementation:

In [None]:
pass_velocity = 0 
momentum = 0.1
while loss > 0.01:
    w, loss, gradient = get_current_parameters()
    velocity = past_velocity * momentum + learning_rate * gradient
    w = w + momentum * velocity - learning_rate * gradient
    past_velocity = velocity
    update_parameter(w)

##### Chaining Derivatives: the Backpropagation Algorithm

in practice, a neural network function consists of many tensor operations chained together, we can not explicityly compute its derivative

for instance, a network [f] compose of three tensor operations, [a], [b] and [c], with weight matrices [W1], [W2] and [W3]:

In [None]:
f(W1, W2, W3) = a(W1, b(W2, c(W3)))

calculus tells us that such a chain of functions can be derived following the 'chain rule': f(g(x)) = f'(g(x)) * g'(x), applying the chain rule to the computation of the gradient values of a neural network gives rise to an algorithm called 'Backpropagation' (also sometimes called 'reverse-model differentiation')

Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, applying the chain rule to compute the contributoin that each parameter had in the loss value

Nowadays, and for years to come, people will implement networks in modern frameworks that are capable of 'symbolic differentiation', such as TensorFlow, this means taht given a chain of operations with a known derivative, they can compute a gradient function for the chain (by applying the chain rule) that maps network parameter values to gradient values. 

when you have access to such a function, the backward pass is reudced to a call to this gradient function, thanks to symbolic differentiation, you will never have to implement the backpropagation algorithm by hand, all you need is a good understanding of how gradient-based optimisation works. 