# Gradient descent by batch vs. mini-batch vs. stochastic
M5U5 - Exercise 2

## What are we going to do?
- Modify our batch gradient descent implementation from batch to mini-batch and stochastic
- Test the differences between the training according to the 3 methods

Remember to follow the instructions for the practice deliverables given in [Submission instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Instructions

So far we have talked about training models or optimising functions by gradient descent. However, we have omitted to refer to "batch" gradient descent, which can be distinguished from mini-batch and stochastic gradient descent.

For a detailed comparison of the 3 types you can refer to the course content. Just a reminder:
- "Batch" = training data set for 1 iteration or "epoch".
- Iteration or "epoch": iteration over training, loop after which the weights of $\Theta$.
- Batch: iteration or "epoch" over all training data before updating $\Theta$.
    - Slow but steady, eventually converges.
- Stochastic: One iteration per training example.
    - Fast at start but very unstable, takes much longer to converge. Cannot be parallelised.
    - "Stochastic" as it is much more random in its path.
- Mini-batch: iteration per partition of training data, e.g. 10% of data or 10 partitions.
    - Best of both worlds: faster than batch, more stable than stochastic, converges and can be parallelised.

We will implement all 3 types, either manually or customised with Numpy or Scikit-learn, and compare their characteristics, in this case for **linear regression**:

In [None]:
# TODO: Import all necessary libraries into this cell

## Synthetic dataset generation and data processing

Retrieve your cells to create a synthetic dataset for linear regression, with Numpy or Scikit-learn methods:
- Create a dataset with no error term
- Rearrange the data randomly
- Normalise the data if necessary
- Split the dataset into training and test subsets, we will not do validation or regularisation in this exercise.

In [None]:
# TODO: Create a synthetic dataset for linear regression with no error term

In [None]:
# TODO: Rearrange the data randomly

In [None]:
# TODO: Normalise the data if needed

In [None]:
# TODO: Divide the dataset into training and test subsets

## Customised gradient descent

### Batch gradient descent

Recall the cost function and gradient descent equations for the regularised batch gradient descent:

$$ h_\theta(x^i) = Y = X \times \Theta^T $$
$$ J_\theta = \frac{1}{2m} [\sum\limits_{i=0}^{m} (h_\theta(x^i)-y^i)^2 + \lambda \sum\limits_{j=1}^{n} \theta^2_j] $$
$$ \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_0^i $$
$$ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_j^i; \space j \in [1, n] $$

We are going to retrieve the batch gradient descent implementation you have used in previous exercises to take it as the basis for the mini-batch or stochastic gradient descent.

Start by retrieving the implementation cells of the cost function, its implementation check, the regularised gradient descent, the training of a model and its implementation check.

Once retrieved, it executes the cells, adding the suffix `_batch` to the variables of the cost function evolution and final $\Theta$:

In [None]:
# TODO: Retrieve the cell that implements the cost function

In [None]:
# TODO: Retrieve the cell that checks the implementation of the cost function

In [None]:
# TODO: Retrieve the cell that implements the gradient descent function

In [None]:
# TODO: Retrieve the cell that trains a model with a training dataset and a given hyper-parameters

In [None]:
# TODO: Retrieve the cell that tests the implementation of the gradient descent function

In [None]:
# TODO: Retrieve the cell that graphically represents the evolution of the cost function history vs. iterations

### Stochastic gradient descent

In the stochastic gradient descent, we update the values of $\Theta$ after each example, ending an epoch when we complete one pass through all examples.

Therefore, the training algorithm will be:
1. Reorder the examples randomly (we have already reordered them).
1. Initialise $\Theta$ to random values.
1. For each epoch, up to a maximum number of iterations:
    1. For each training example:
        1. Compute the prediction or hypothesis $h_\Theta(x^i)$
        1. Compute the cost, loss or error of that prediction
        1. Compute the gradients of the coefficients $\Theta$
        1. Update the coefficients $\Theta$

Therefore, the regularised stochastic gradient descent and cost function equations are:

$$ h_\theta(x^i) = y^i = x^i \times \Theta^T $$
$$ J_\theta(x^i) = \frac{1}{2m} [(h_\theta(x^i) - y^i)^2 + \lambda \sum\limits_{j=1}^{n} \theta^2_j] $$
$$ \theta_0 := \theta_0 - \alpha \frac{1}{m} (h_\theta (x^i) - y^i) x_0^i $$
$$ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} (h_\theta (x^i) - y^i) x_j^i; \space j \in [1, n] $$

Now adapt your model training cell for stochastic gradient descent and train a model on the training data:

*NOTE:* Try to use the same hyper-parameters and initial Theta for all models, so that you can compare them a posteriori under the same circumstances.

In [None]:
# TODO: Adapt the regularised gradient descent function to stochastic
# NOTE: Check the implementation first before modifying it. Many changes may not be necessary...

In [None]:
# TODO: Train a stochastic gradient descent model
# Add the suffix "_stochastic" to the outcome variables to distinguish it from other models

In [None]:
# TODO: Test the implementation of stochastic gradient descent under various circumstances

In [None]:
# TODO: Plot the evolution of the cost function graphically

### Mini-batch gradient descent

In gradient descent with mini-batches, we update the values of $\Theta$ after each subset of examples or "batch", a partition of the training subset, ending an epoch when we complete one pass through all the "batches" or examples.

Therefore, the training algorithm will be:
1. Reorder the examples randomly (we have already reordered them).
1. For each epoch, up to a maximum number of iterations:
    1. Initialise $\Theta$ to random values.
    1. Divide the training examples into *k* "batches".
    1. For each "batch":
        1. Compute the prediction or hypothesis $h_\Theta(x^i)$ over the entire "batch"
        1. Compute the cost, loss or error of the prediction over it.
        1. Compute the gradients of the coefficients $\Theta$
        1. Update the coefficients $\Theta$

Therefore, the equations of the cost function and gradient descent with regularised mini-batches are:

$$ m_k = \text{number of examples in the current "batch"} $$
$$ h_\theta(x^i) = Y = X \times \Theta^T $$
$$ J_\theta = \frac{1}{2 m_k} [\sum\limits_{i=0}^{m_k} (h_\theta(x^i)-y^i)^2 + \lambda \sum\limits_{j=1}^{n} \theta^2_j] $$
$$ \theta_0 := \theta_0 - \alpha \frac{1}{m_k} \sum_{i=0}^{m_k}(h_\theta (x^i) - y^i) x_0^i $$
$$ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m_k}) - \alpha \frac{1}{m_k} \sum_{i=0}^{m_k}(h_\theta (x^i) - y^i) x_j^i; \space j \in [1, n] $$

Now adapt your model training cell for stochastic gradient descent and train a model on the training data:

*NOTE:* Try to use the same hyper-parameters and initial Theta for all models, so that you can compare them a posteriori under the same circumstances.

In [None]:
# TODO: Adapt the regularised gradient descent function to mini-batch
# NOTE: Check the implementation first before modifying it. You may not need to make many changes...

In [None]:
# TODO: Train a model by gradient descent with mini-batches
# Add the suffix "_mini_batch" to result variables to distinguish it from other models

In [None]:
# TODO: Test the implementation of gradient descent with mini-batches in various circumstances

In [None]:
# TODO: Plot the evolution of the cost function graphically

## Comparación de métodos

Responde a las siguientes preguntas en la siguiente celda:
*PREGUNTAS:*
1. *¿Cuánto era necesario modificar las funciones de descenso de gradiente?*
1. *¿Qué modelo ha tenido menor coste final?*
1. *¿Qué modelo ha tardado menos tiempo en entrenarse/converger?*
1. *¿Cómo han sido las evoluciones de la función de coste? ¿Han sido comparables en cuanto a estabilidad, p. ej.?*

*RESPUESTAS:*
1. ...
1. ...
1. ...
1. ...

### Comparación de residuos y precisión

Calcula la precisión como RMSE y representa gráficamente los residuos de los 3 modelos:

In [None]:
# TODO: Calcula el RMSE de los 3 modelos

In [None]:
# TODO: Representa gráficamente los residuos de los 3 modelos
# Usa una gráfica de puntos con 3 series de colores diferentes y su leyenda
# Incluye una rejilla

*PREGUNTA:* ¿Aprecias diferencias entre ellos?

## Gradient descent con Scikit-learn

Ahora entrena 3 modelos y compara su rendimiento utilizando los métodos de Scikit-learn, en concreto regresión lineal por [linear_model.SGDRegressor](https://scikit-learn.org/0.15/modules/generated/sklearn.linear_model.SGDRegressor.html) con sus métodos `fit()` y [partial_fit()](https://scikit-learn.org/0.15/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor.partial_fit):

In [None]:
# TODO: Entrena un modelo por descenso de gradiente en batch con Scikit-learn
# Añade el sufijo "_batch" a las variables resultado para distinguirlo de otros modelos
# Muestra su tiempo de entrenamiento
# Calcula su coste y RMSE final
# Representa gráficamente sus residuos

In [None]:
# TODO: Entrena un modelo por descenso de gradiente estocástico con Scikit-learn
# Añade el sufijo "_estocastico" a las variables resultado para distinguirlo de otros modelos
# Muestra su tiempo de entrenamiento
# Calcula su coste y RMSE final
# Representa gráficamente sus residuos

In [None]:
# TODO: Entrena un modelo por descenso de gradiente coni mini-batches con Scikit-learn
# Añade el sufijo "_mini_batch" a las variables resultado para distinguirlo de otros modelos
# Muestra su tiempo de entrenamiento
# Calcula su coste y RMSE final
# Representa gráficamente sus residuos

*PREGUNTA:* ¿Aprecias diferencias entre ellos?