# Gradient-Based Optimisation

## Introduction

>Previously, we learnt how to optimise the parameters of linear regression using its analytical solutions. However, this technique has some fundamental drawbacks:

- It is available only for special cases of ML methods, i.e. __it cannot be generalised.__
- Its computation is not feasible for large datasets and cases involving many features.

In this notebook, we learn how to use **gradient-based optimisation** as another technique to find model parameterisations that perform well.

Gradient-based optimisation is not exclusively used in ML. It is a technique used for optimising all kinds of functions in all kinds of domains.

> We will employ gradient-based optimisation to minimise the criteria of our ML algorithms.

To improve your understanding, a depiction of gradient-based optimisation is provided in the diagram below. In our case, $f(w)$ represents the criterion that we intend to minimise, which varies with the model parameters ($w$ & $b$ for linear regression):

![](images/gradient_descent_intuition.jpg)

## Understanding Gradient-Based Optimisation

Our loss is simply a mathematical function that depends on the parameters of our model (for example, we used the mean-squared error (MSE) loss function in the previous notebooks).
Our objective is to move the parameters to the point where this loss is minimised.

> If we evaluated the loss value for every possible different parameterisation of our model, we would produce a **loss surface**. 

The next step would involve identifying the lowest point on this surface. 
- At this point, it will have a gradient (steepness) of zero with respect to the parameters.
- As the parameters move away from that minima in any direction, the gradient will increase in that direction.

To return to the minima, we would __move our weights in the opposite direction to the gradient__ (simply subtract it).

![](./images/grad-based-optim.jpg)

## Numerical Example

Below is an example showing the shift direction for parameter $W$, initialised as $w=4$, for a surface given by

$$
L=(W-2)^2
$$ 

At this point on the surface, the loss gradient with respect to this parameter is positive; therefore, we should shift it in the negative direction to move it closer to the optima.

![](images/sgd_numerical_example.jpg)

Below is a more complex potential loss surface with more than one parameter (the vertical axis represents the loss value, while others represent the parameter values). In reality, we will often have many more features, and we will not be able to visualise the loss surface since this would require more than three dimensions.

<img style="height: 200px" src='./images/comp-loss-surface.png'/>

> **Note: Since gradient-based optimisation depends on us computing the gradient of the loss function, our loss function and model must be fully differentiable (i.e. they must be a smooth, continuous function).**

## Gradient Descent

Gradient descent is an iterative, gradient-based optimisation technique. 

In other words, it is a technique for finding the minima (or maxima) of a function, and it does so by iteratively moving the parameters downhill in the opposite direction to the surface gradient.

![](images/gradient_descent_intuition.jpg)

## The learning rate ($\alpha$)

To update the parameters, we shift them in the opposite direction to the gradient. However, by what degree should they be shifted?

> The learning rate, $\alpha$ (often abbreviated as `lr` in the source code), __multiplies the gradient__ to decrease (usually) or increase its magnitude.

Thus, the `step_size` is the `gradient` multiplied by the `lr`.

### Things to note

#### Low `lr`

If the `lr` is significantly low, we may 
- __experience a delay__ in the convergence.
- be unable to move from the local minima or saddle points (we will go over that shortly).

![title](images/low-lr.jpg)

#### High `lr`

If the `lr` is significantly high, we may 
- jump from the minimum.
- diverge rather than __converge__.

![title](images/high-lr.jpg)

Therefore, we include the `lr` to scale up/down the steps. 

> Note, however, that the `lr` should mostly be less than 1.

Experiment with the learning rate, and adjust it until your model converges.

![title](images/convergence.jpg)

## Local Optima

If we are attempting to minimise a function with respect to 1 or 2 parameters, the gradient descent may get stuck in local optima.

However, most of the useful models in practice depend on many more parameters (neural networks can easily have millions).

> As the number of parameters increases, it becomes exponentially unlikely that any parameterisation occurs at the minima, rather than at a saddle point; therefore, there is still room for improvement.

Furthermore, __in practice, we often do not need to find a global optima.__
The local optima can be good enough to realise the required performance.

Moreover, we can attempt to counter the inability to move from the local optima using different optimisation algorithms, such as [gradient descent with (Nesterov) momentum](https://distill.pub/2017/momentum/).

### Gradient descent

The diagrams shown above visualise how a single parameter affects the loss. 

A model with multiple parameters (such as a weight and a bias, or multiple weights) would be optimised in the same way; there would simply be more of these functions. 

> Think of each of the graphs as a cross-section through a **loss surface**. 

A loss surface is shown below, which visualises the variation in a model's criterion as a function of both parameters.

$$
L = w_1^4 + w_2^2
$$

<img style="height: 300px" src='images/x2x4.png'/>

![](images/multivariate_sgd.jpg)

If we know the function from which the loss is computed and it is differentiable, we can calculate the derivative of the loss with respect to our model parameters by hand or using an automatic differentiation graph (we will cover this in Deep Learning).

Thereafter, we can iteratively move each parameter in the direction of the opposite gradient.

### A helper function

We will use this code shortly to visualise the training progress.

In [None]:
import matplotlib.pyplot as plt
def plot_loss(losses):
    """Helper function for plotting the loss against the epoch"""
    plt.figure() # make a figure
    plt.ylabel('Cost')
    plt.xlabel('Epoch')
    plt.plot(losses) # plot costs
    plt.show()

### The data

Run the cells below to obtain the data and plot it.

In [1]:
!pip install aicore

Collecting aicore
  Downloading aicore-0.0.3-py3-none-any.whl (11 kB)
Collecting scikit-learn>=0.23.2
  Downloading scikit_learn-1.0.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.7 MB)
[K     |████████████████████████████████| 24.7 MB 206 kB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Collecting joblib>=0.11
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 90.6 MB/s 
[?25hInstalling collected packages: threadpoolctl, joblib, scikit-learn, aicore
Successfully installed aicore-0.0.3 joblib-1.1.0 scikit-learn-1.0.1 threadpoolctl-3.0.0


In [5]:
from sklearn import datasets, model_selection
from aicore.ml import data
import pandas as pd
import numpy as np

# Use `data.split` to split the data into training, validation, and test sets.
(X_train, y_train), (X_validation, y_validation), (X_test, y_test) = data.split(
    datasets.load_boston(return_X_y=True)
)
X_train, X_validation, X_test = data.standardize_multiple(X_train, X_validation, X_test)



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

### The model

Here is the same model we implemented previously.

In [None]:
class LinearRegression:
    def __init__(self, optimiser, n_features): # initalise parameters 
        self.w = np.random.randn(n_features) ## randomly initialise the weight
        self.b = np.random.randn() ## randomly initialise the bias
        self.optimiser = optimiser
        
    def predict(self, X): # how do we calculate the output from an input in our model?
        ypred = X @ self.w + self.b ## make a prediction using a linear hypothesis
        return ypred # return prediction

    def fit(self, X, y):
        all_costs = [] ## initialise an empty list of costs to plot later
        for epoch in range(self.optimiser.epochs): ## for this many complete the runs through the dataset    

            # MAKE PREDICTIONS AND UPDATE MODEL
            predictions = self.predict(X) ## make predictions
            new_w, new_b = self.optimiser.step(self.w, self.b, X, predictions, y) ## calculate updated params
            self._update_params(new_w, new_b) ## update the model weight and bias
            
            # CALCULATE THE LOSS FOR VISUALISATION
            cost = mse_loss(predictions, y) ## compute the loss 
            all_costs.append(cost) ## add the cost for this batch of examples to the list of costs (for plotting)

        plot_loss(all_costs)
        print('Final cost:', cost)
        print('Weight values:', self.w)
        print('Bias values:', self.b)

    
    def _update_params(self, new_w, new_b):
        self.w = new_w ## set this instance's weights to the new weight value passed to the function
        self.b = new_b ## do the same for the bias

### The criterion

Recall the formula:

$$
\begin{equation}
    L_{mse} = \frac{1}{N}\sum_{i}^{N}(\hat{y_i} - y_i)^2
\end{equation}
$$

In [None]:
def mse_loss(y_hat, labels): # define the criterion (loss function)
    errors = y_hat - labels ## calculate the errors
    squared_errors = errors ** 2 ## square the errors
    mean_squared_error = sum(squared_errors) / len(squared_errors) ## calculate the mean 
    return mean_squared_error # return the loss

### The optimiser: gradient descent

With linear regression, it is possible to swap out different optimisers and use the same model, data and criterion.

#### Implementing gradient descent from scratch

Below is a derivation for computing the rate of change (gradient) in the loss with respect to our model's parameters when using a linear model and the MSE loss function.
![title](images/NN1_single_grad_calc.jpg)

Complete the class below to return the derivative of the loss w.r.t the weight and bias by implementing the above equations in code.

In [None]:
import numpy as np

class SGDOptimiser:
    def __init__(self, lr, epochs):
        self.lr = lr
        self.epochs = epochs

    def _calc_deriv(self, features, predictions, labels):
        m = len(labels) ## m = number of examples
        diffs = predictions - labels ## calculate the errors
        dLdw = 2 * np.sum(features.T * diffs).T / m ## calculate the loss derivative with respect to the weights
        dLdb = 2 * np.sum(diffs) / m ## calculate the loss derivative with respect to the bias
        return dLdw, dLdb ## return the rate of change in the loss with respect to w and b, separately.

    def step(self, w, b, features, predictions, labels):
        dLdw, dLdb = self._calc_deriv(features, predictions, labels)
        new_w = w - self.lr * dLdw
        new_b = b - self.lr * dLdb
        return new_w, new_b
    

### The combination



In [None]:
num_epochs = 1000
learning_rate = 0.001

optimiser = SGDOptimiser(lr=learning_rate, epochs=num_epochs)
model = LinearRegression(optimiser=optimiser, n_features=X_train.shape[1])
model.fit(X_train, y_train)

## `sklearn` example

`sklearn` packs all the steps above into a simple [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) API.

In [None]:
from sklearn.linear_model import LinearRegression

linear_regression_model = LinearRegression() ## instantiate the linear-regression model

In [None]:
def calculate_loss(model, X, y):
    return mse_loss(
        model.predict(X),
        y
    )

print(f"Training loss before fit: {calculate_loss(model, X_train, y_train)}")
print(
    f"Validation loss before fit: {calculate_loss(model, X_validation, y_validation)}"
)
print(f"Test loss before fit: {calculate_loss(model, X_validation, y_validation)}")

In [None]:
model = linear_regression_model.fit(X_train, y_train) ## fit the model

Now, we will perform the same task; however, we will fit the model for some epochs and observe the loss after training, validation and testing:

In [None]:
epochs = 10000
model.fit(X_train, y_train)

print(f"Training loss after fit: {calculate_loss(model, X_train, y_train)}")
print(f"Validation loss after fit: {calculate_loss(model, X_validation, y_validation)}")
print(f"Test loss after fit: {calculate_loss(model, X_validation, y_validation)}")

Additionally, we can examine the final parameters of the model when using sklearn:

In [None]:
print('final weights:', model.coef_)
print('final bias:', model.intercept_)

## Benefits of Gradient-Based Optimisation

> Gradient-based optimisation uses __heuristics,__ which indicate the method of improvement w.r.t. the loss (e.g. how to minimise it).

There are other available options, such as analytical solutions; however, they have noticeable shortcomings.

We could also search for parameters __randomly__; however,

- our search region may not contain an optimal parameterisation for our model. For example, if we allowed bias `[-10, 10]`, we would never obtain the solution. 
- we will experience an exponential increase in runtime with each additional parameter.
- there will be no feedback from the process (the gradient is our feedback here).

> The question of what to do when the data do not fit into the memory remains unanswered.

## Passing the Whole Dataset Through Each Prediction

We are aware that to perform gradient-based optimisation, inputs must be passed through the model (forward pass), following which the loss is computed and its variation with respect to the model's parameters investigated (backward pass).

Modern datasets can be absolutely large. This implies that the forward pass can be time-intensive, since the function represented by the model has to be applied to each provided input for a forward pass.

> Passing the full dataset through the model at each pass is called **full-batch gradient descent**.

### Disadvantages

- The whole dataset may hinder generalisation and lead to overfitting.
- It is quite slow (relatively slow memory access, more cache misses, etcetera).


## Using a Single Datapoint for Each Update

If we pass a single example to our network and backpropagate based on that, these will probably occur:

- the `gradient` will vary __significantly__ (a single example is usually uninformative for the whole task).
- the `outliers` (special data points, which could as well be noise and are completely non-representative of the task) will have a considerable effect on the dataset.

The approach involving passing single examples through the model at each pass is called **online stochastic gradient descent**.

If, for some reason (memory constraints or examples come in as a stream), we have to employ this approach, mini-batch gradient descent would be the best solution.

## Mini-Batch Gradient Descent

The modern technique for conducting training involves neither the whole dataset nor the single datapoint (fully stochastic). 

Instead, we use mini-batch training:

> Sample several (usually `64-2048`, depending on the memory) datapoints to compute a sample of the gradient.

Most optimisation algorithms converge at a considerably high speed if they are allowed to rapidly compute approximate gradients rather than slowly compute exact gradients. 

The size of the mini-batch is called the **batch size**, and this technique is currently the most widely used, particularly in neural networks.

### Advantages

- The size can be adjusted to fit the memory on most machines.
- High speed (parallelise is easy to realise for multiples of `2`).
- It improves generalisation as each batch is slightly noisy.

## Data Shuffling

Data shuffling is particularly important for __large and highly complex models__ (neural networks). If this is not carried out, we might risk the following:

- The same updates may be provided to the model at each batch. Since we intend to estimate the total average, the batches must be different.
- The model 'memorises' the batch (the occurs in neural networks).

## Poor Conditioning

Different features in different datasets can have different ranges.
- Some features can be binary or in the `[0, 1]` range.
- Others have values in the hundreds or even thousands.

This is problematic for most ML models because when a small change in the weight connected to features with large values occurs, the output changes significantly. This increases the influence of the weight, and consequently the feature, simply because it is larger. 

Resultantly, the loss function will appear as that shown in the image below: steep in one direction and shallow in another.

![](images/unnormalised.png)

In the steep direction, the learning rate will have to be sufficiently low to prevent diverging optimisation. However, because the gradient signal in the other direction has an overly low intensity, no progress will be made in that direction.

> If the features are on different orders of magnitude, the maximum learning rate that works will be overly small to make progress in every direction.

### Example

Consider a case with two weights, `a` and `b`, and a single example, `x`, with two features:

- the first has a value `0.1`, whereas the second has a value of `1000`.

Now, the formula for the linear regression would be

$$
    \hat{y} = 0.1a + 1000b
$$

Now, we investigate the impact of `a` and `b` on $\hat{y}$:
- $a = 10, b = 0.001$ - `a` and `b` have the same impact on $\hat{y}$.
- $a = 1, b = 1$ - `b` has `10000` times (!!!) more impact on $\hat{y}$.

It is unlikely that `a` has `10000` times less impact on the value we intend to predict (this is also unlikely in the real world).

> We should assume that all variables are __equally important, unless we verify them__ via statistical testing or other measures.

The range of the values __is not as important a factor__ as the relative differences between the values.

#### Solution 1

One idea would be to use a different learning rate for each weight. However, we will have to search for the correct learning rate as many times as the number of parameters available. In many cases, examples can have a large number of features (in images, each pixel is a feature).

#### Solution 2: normalisation or standardisation

As a better solution, the data can be normalised.

> Normalisation is the process of bringing features to the same value range.

This ensures that the relative differences between the values for each feature are prioritised, not the scale.

> It is good practice to always normalise features, unless they are not continuous.

## Normalisation and Standardisation

There are numerous schemes to put values in the `[0, 1]` range. Here, we will employ the `minmax` approach.
We can do this by subtracting the minimum and subsequently dividing by the range (feature normalisation).

![title](images/normalisation.jpg)

Alternatively, we can use a similar method called standardisation, where we subtract the mean and subsequently divide by the standard deviation.

![](images/standardisation.jpg)

Feature normalisation puts the gradients of each different model parameter in the same order of magnitude. This converts loss surfaces that appear as *valleys* into loss surfaces that appear as *bowls*. Feature normalisation promotes optimisation for all model parameters using the same learning rate.

![](images/bowl.png)

> Always normalise and standardise your input data.

## Normalisation Issues

### Data leakage

The statistics computed from the unsplit dataset will contain information about every example.

Normalising dataset splits based on such statistics will leak information about the test and validation sets.

> Always split before normalising your data. 

### Training-testing skew

This occurs when the training data appear different from the testing data. This can be a result of normalising the validation or testing sets using their mean and standard deviation, rather than the same ones employed to normilise the training data. If the other sets have different statistics, your model may receive inputs that appear rather different from those on which it was trained to make predictions.

> All data sets should be normalised using the same statistics.

Normalise your validation and test sets using the mean and standard deviation from your training set.

## Example

Below, we provide a function for normalising data. Notice how it computes the statistics if they are not provided (as it should for the training set), but will otherwise allow you to pass them in (as you should for the other sets).

In [None]:
def standardize_data(dataset, mean=None, std=None):
    if mean is None and std is None:
        mean, std = np.mean(dataset, axis=0), np.std(
            dataset, axis=0
        )  ## get the mean and standard deviation of the dataset
    standardized_dataset = (dataset - mean) / std
    return standardized_dataset, (mean, std)

X_train, (mean, std) = standardize_data(X_train)

## Conclusion
At this point, you should have a good understanding of

- the gradient-based optimisation and learning rate concepts.
- how to implement the stochastic gradient descent algorithm from scratch in Python.
- how to implement linear regression from scratch in Python.