### 2.4 The engine of neural networks: gradient-based optimization

As you saw in the previous section, each neural layer from our first network example transforms its input data as follows
```
output = relu(dot(W, input) + b)
```

In this expression, `W` and `b` are the tensors that are attributes of the layer. They're called the *weights* or *trainable parameters* of the layer (the `kernel` and `bias` attributes, respectively). These weights contain the information learned by the network from exposure to training data.<br>
Initially, these weight matrices are filled with small random values (a step called *random initialization*). Of course, there's no reason to expect that `relu(dot(W, input) + b)`, when `W` and `b` are random, will yield any useful representations. The resulting representation are meaningless but they're a starting point. What comes next is to gradually adjust these weights, based on a feedback signal. This gradual adjustment, also called *training*, is basically the learning that machine learning is all about.<br>
This happens within what's called a *training loop*, which works as follows. Repeat these steps in a loop, as long as necessary:<br>
- Draw a batch of training samples `x` and corresponding targets `y`.
- Run the network on `x` (a step called the *forward pass*) to obtain predictions `y_pred`.
- Compute the loss of the network on the batch, a measure of the mismatch between `y_pred` and `y`.
- Update all weights of the network in a way that slightly reduces the loss on this batch.<br>

You'll eventually end up with a network that has a very low loss on its training data: a low mismatch between predictions `y_pred` and expected targets `y`. The network has "learned" to map its inputs tp correct target. From afar, it may look like magic, but when you reduce it to elementary steps, it turns out to be simple.<br>

Step 1 sound easy enough, just I/O code. Step 2 and 3 are merely the application of a handful of tensor operations, so you could implement these steps purely from what you learned in the previous section. The difficult part is step 4: updating the network's weights. Given an individual weight coefficient in the network, how can you compute whether the coefficient should be increased or decreased, and by how much?

One naive solution would be to freeze all weights in the network except the one scalar coefficient being considered, and try different values for this coefficient. Let's say the initial value of the coefficient is 0.3. After the forward pass on a batch of data, the loss of the network on the batch is 0.5. If you change the coefficient's value to 0.35 and rerun the forward pass, the loss increases to 0.6. But if you lower the coefficient to 0.25, the loss falls to 0.4. In this case, it seems that updating the coefficient by -0.05 would contribute to minimizing the loss. This would have to be repeated for all coefficients in the network.

But such an approach would be horribly inefficient, because you'd need to compute two forward passes (which are expensive) for every individual coefficient (of which there are many, usually thousands and sometimes up to millions). A much better approach is to take advantage of the fact that all operations used in the network are *differentiable*, and compute the *gradient* of the loss with regards to the network's coefficients. You can then move the coefficients in the opposite direction from the gradient, thus decreasing the loss.

If you already know what *differentiable* means and what a *gradient* is, you can skip to section 2.4.3. Otherwise, the following two sections will help you understand these concepts.

#### 2.4.1 What's a derivative?

Consider a continuous, smooth function `f(x) = y`, mapping the real number `x` to a new real number `y`. Because the function is *continuous*, a small change in `x` can only result in a small change in `y`, that's the intuition behind continuity. Let's say you increase `x` by a small factor `epsilon_x`: this results in a small `epsilon_y` change to `y`:
```
f(x + epsilon_x) = y + epsilon_y
```
In addition, because the function is smooth (it's curve doesn't have any abrupt angles), when `epsilon_x` is small enough, around a certain point `p`, it's possible to approximate `f` as a linear function of slope `a`, so that `epsilon_y` becomes `a * epsilon_x`:
```
f(x + epsilon_x) = y + a * epsilon_x
```
Obviously, this linear approximation is valid only when `x` is close enough to `p`. 

The slope `a` is called the *derivative* of `f` in `p`. If `a` is negative, it means a small change of `x` around `p` will results in a decrease of `f(x)`; and if `a` is positive, a small change in `x` will result in an increase of `f(x)`. Further, the absolute value of `a` (the *magnitude* of the derivative) tells you how quickly this increase or decrease will happen.
<img src="./Images/Derivative_of_f_in_p.png" alt="Derivative of f in p" title="Derivative of f in p" style="width: 200px;"/>

For every differentiable function `f(x)` (*differentiable* means "can be derived":for example, smooth, continuous function can be derived), there exists a derivative function `f'(x)` that maps values of x to the slope of the local linear approximation of `f` in those points. For instance, the derivative of `cos(x)` is `-sin(x)`, the derivative of `f(x) = a * x` is `f'(x) = a` and so on.

If you are trying to update x by a factor `epsilon_x` in order to minimize `f(x)`, and you know the derivative of `f`, then your job is done: the derivative completely describes how `f(x)` evolves as you change x. If you want to reduce the value of `f(x)`, you just need to move x a little in the opposite direction from the derivative.

#### 2.4.2 Derivative of a tensor operation: the gradient

A *gradient* is the derivative of a tensor operation. It's the generalization of the concept of derivatives to functions of multidimensional inputs: that is, to functions that take tensors as inputs.

Consider an input vector `x`, a matrix `W`, a target `y`, and a loss function `loss`. You can use `W` to compute a target candidate `y_pred`, and compute the loss, or mismatch, between the target candidate `y_pred` and the target `y`:
```
y_pred = dot(W, x)
loss_value = loss(y_pred, y)
```
If the data inputs `x` and `y` are frozen, then this can be interpreted as a function mapping values of `W` to loss value:
```
loss_value = f(W)
```

Let's say the current value of `W` is `W0`. Then the derivative of `f` in the point `W0` is a tensor `gradient(f) (W0)` with the same shape as `W`, where each 