# Linear Regression: Regularisation
M2U3 - Exercise 2

## What are we going to do?
- We will implement a regularised cost function for multivariate linear regression
- We will implement the regularisation for gradient descent

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

In [None]:
import time
import numpy as np

from matplotlib import pyplot as plt

## Creation of a synthetic dataset

To test your implementation of a regularised gradient descent and cost function, retrieve your cells from the previous notebooks on synthetic datasets and generate a dataset for this exercise.

Don't forget to add a bias term to *X* and an error term to *Y*, initialized to 0 for now.

In [None]:
# TODO: Manually generate a synthetic dataset, with a bias term and an error term initialised to 0

m = 1000
n = 3

X = [...]

Theta_true = [...]

error = 0.

Y = [...]

# Check the values and dimensions of the vectors
print('Theta to be estimated and its dimensions:')
print()
print()

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

## Regularised cost function

We will now modify our implementation of the cost function from the previous exercise to add the regularisation term.

Recall that the regularised cost function is:

$$ h_\theta(x^i) = Y = X \times \Theta^T $$
$$J_\theta = \frac{1}{2m} [\sum\limits_{i=0}^{m} (h_\theta(x^i)-y^i)^2 + \lambda \sum\limits_{j=1}^{n} \theta^2_j]$$

In [None]:
# TODO: Implement the regularised cost function according to the following template

def regularized_cost_function(x, y, theta, lambda_=0.):
    """ Computes the cost function for the considered dataset and coefficients.
    
    Positional arguments:
    x -- Numpy 2D array with the values of the independent variables from the examples, of size m x n
    y -- Numpy 1D array with the dependent/target variable, of size m x 1
    theta -- Numpy 1D array with the weights of the model coefficients, of size 1 x n (row vector)
    
    Named arguments:
    lambda -- float with the regularisation parameter
    
    Return:
    j -- float with the cost for this theta array
    """
    m = [...]
    
    # Remember to check the dimensions of the matrix multiplication to perform it correctly
    # Remember not to regularize the coefficient of the bias parameter (first value of theta)
    j = [...]
    
    return j

*NOTE:* Check that the function simply returns a float value, and not an array or matrix. Use the `ndarray.resize((size0, size1))` method if you need to change the dimensions of any array before you multiply it with `np.matmul()` and make sure the result dimensions match, or returns `j[0,0]` as the `float` value.

As the synthetic dataset has the error term set at 0, the result of the cost function for the *Theta_true* with parameter *lambda* = 0 must be exactly 0.

As before, as we move away with different values of θ, the cost should increase. Similarly, the higher the *lambda* regularisation parameter, the higher the penalty and cost, and the higher the *Theta* value, the higher the penalty and cost as well.

Check your implementation in these 5 scenarios:
1. Using *Theta_true* and with *lambda* at 0, the cost should still be 0.
1. With *lambda* still at 0, as the value of *theta* moves away from *Theta_true*, the cost should increase.
1. Using *Theta_true* and with a *lambda* other than 0, the cost must now be greater than 0.
1. With a *lambda* other than 0, for a *theta* other than *Theta_true*, the cost must be higher than with *lambda* equal to 0.
1. With a *lambda* other than 0, the higher the values of the coefficients of *theta* (positive or negative), the higher the penalty and the higher the cost.

Recall that the value of lambda must always be positive and generally less than 0: `[0, 1e-1, 3e-1, 1e-2, 3e-2, ...]`

In [None]:
# TODO: Check the implementation of your regularised cost function in these scenarios

theta = Theta_true    # Modify and test various values of theta

j = regularized_cost_function(X, Y, theta)

print('Cost of the model:')
print(j)
print('Tested Theta and actual Theta:')
print(theta)
print(Theta_true)

Record your experiments and results in this cell (in Markdown or code):
1. Experiment 1
1. Experiment 2
1. Experiment 3
1. Experiment 4
1. Experiment 5

## Regularised gradient descent

Now we will also regularise the training by gradient descent. We will modify the *Theta* updates so that they now also contain the *lambda* regularisation parameter:

$$ \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_0^i $$
$$ \theta_j := \theta_j - \alpha [\frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_j^i + \frac{\lambda}{m} \theta_j]; \space j \in [1, n] $$
$$ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_j^i; \space j \in [1, n] $$

Remember to build again on your previous implementation of the gradient descent function.

In [None]:
# TODO: Implement the function that trains the regularised gradient descent model

def regularized_gradient_descent(x, y, theta, alpha, lambda_=0., e, iter_):
    """ Trains the model by optimising its cost function using gradient descent
    
    Positional arguments:
    x -- Numpy 2D array with the values of the independent variables from the examples, of size m x n
    y -- Numpy 1D array with the dependent/target variable, of size m x 1
    theta -- Numpy 1D array with the weights of the model coefficients, of size 1 x n (row vector)
    alpha -- float, training rate
    
    Named arguments (keyword):
    lambda -- float with the regularisation parameter
    e -- float, minimum difference between iterations to declare that the training has finally converged
    iter_ -- int/float, nº of iterations
    
    Return:
    j_hist -- list/array with the evolution of the cost function during the training
    theta -- Numpy array with the value of theta at the last iteration
    """
    # TODO: enters default values for e and iter_ in the function keyword arguments
    
    iter_ = int(iter_)    # If you have entered iter_ in scientific notation (1e3) or float (1000.), convert it
    
    # Initialise j_hist as a list or a Numpy array. Remember that we do not know what size it will eventually be
    j_hist = [...]
    
    m, n = [...]    # Obtain m and n from the dimensions of X
    
    for k in [...]:    # Iterate over the maximum nº of iterations
        # Declare a theta for each iteration as a "deep copy" of theta, since we must update it value by value
        theta_iter = [...]
        
        for j in [...]:    # Iterate over the nº of features
            # Update theta_iter for each feature, according to the derivative of the cost function
            # Include the training rate alpha
            # Careful with the matrix multiplication, its order and dimensions
            
            if j > 0:
                # Regularise all coefficients except for the bias parameter (first coef.)
                pass
            
            theta_iter[j] = theta[j] - [...]
            
        theta = theta_iter
        
        cost = cost_function([...])    # Calculates the cost for the current theta iteration
        
        j_hist[...]    # Adds the cost of the current iteration to the cost history.
        
        # Check if the difference between the cost of the current iteration and that of the last iteration in absolute value
        # is less than the minimum difference to declare convergence, e
        if k > 0 and [...]:
            print('Converge at iteration nº: ', k)
            
            break
    else:
        print('Max n1 of iterations reached')
        
    return j_hist, theta

*Note*: Remember that the code templates are only an aid. Sometimes, you may want to use different code with the same functionality, e.g., iterate over elements in a different way, etc. Feel free to modify them as you wish!

## Checking the regularised gradient descent

To check your implementation again, check with *lambda* at 0 using various values of *theta_ini*, both with the *Theta_true* and values further and further away from it, and check that eventually the model converges to the *Theta_true*:

In [None]:
# TODO: Test your implementation by training a model on the previously created synthetic dataset

# Create an initial theta with a given, random, or hand-picked value
theta_ini = [...]

print('Theta inicial:')
print(theta_ini)

alpha = 1e-1
lambda_ = 0.
e = 1e-3
iter_ = 1e3    # Comprueba que tu función puede admitir valores float o modifícalo

print('Hiper-arámetros usados:')
print('Alpha:', alpha, 'Error máx.:', e, 'Nº iter', iter_)

t = time.time()
j_hist, theta_final = regularized_gradient_descent([...])

print('Tiempo de entrenamiento (s):', time.time() - t)

# TODO: completar
print('\nÚltimos 10 valores de la función de coste')
print(j_hist[...])
print('\Coste final:')
print(j_hist[...])
print('\nTheta final:')
print(theta_final)

print('Valores verdaderos de Theta y diferencia con valores entrenados:')
print(Theta_verd)
print(theta_final - Theta_verd)

Ahora comprueba de nuevo el entrenamiento de un modelo en algunas de las circunstancias anteriores:
1. Usando una *theta_ini* aleatoria y con *lambda* a 0, el coste final debe seguir siendo cercano a 0 y la *theta* final cercana a *Theta_verd*.
1. Usando una *theta_ini* aleatoria y con *lambda* pequeña y distinta de 0, el coste final debe ser cercano a 0, aunque el modelo puede empezar a tener peor precisión.
1. Según aumenta el valor de *lambda*, el modelo perderá más precisión.

Para ello recuerda que puedes modificar los valores de las celdas y reejecutarlas.

Anota tus experimentos y resultados en esta celda (en Markdown o código):
1. Experimento 1
1. Experimento 2
1. Experimento 3
1. Experimento 4
1. Experimento 5

## ¿Por qué necesitábamos utilizar regularización?

El objetivo de la regularización era penalizar el modelo cuando sufre sobre-ajuste, cuando el modelo comienza a memorizar resultados más que aprender a generalizar.

Ésto supone un problema cuando los datos de entrenamiento y sobre los que debemos hacer predicciones en producción siguen distribuciones significativamente diferentes.

Para comprobar nuestro entrenamiento con descenso de gradiente regularizado, vuelve al apartado de generación del dataset y genera uno con un ratio de ejemplos a características bastante menor y con un ratio de error bastante superior.

Comienza a jugar con dichos valores y luego ve modificando la *lambda* del modelo para ver si un valor de *lambda* diferente a 0 comienza a tener más precisión que *lambda* = 0.