# Multivariate Linear Regression: Gradient descent
M2U2 - Exercise 2

## What are we going to do?
- Implement the optimization of the cost function using gradient descent, or in other words, training the model

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Instructions

This exercise is a continuation of the previous exercise "Cost function", so you should build on it.

In [None]:
import time
import numpy as np

from matplotlib import pyplot as plt

## Task 1: Implement the cost function for multivariate linear regression

In this task, you must copy the corresponding cell from the previous exercise, bringing your code to implement the vectorised cost function:

In [None]:
# TODO: Implement the vectorised cost function following the template below

def cost_function(x, y, theta):
    """ Compute the cost function for the considered dataset and coefficients.
    
    Positional arguments:
    x -- Numpy 2D array with the values of the independent variables from the examples, of size m x n
    y -- Numpy 1D array with the dependent/target variable, of size m x 1
    theta -- Numpy 1D array with the weights of the model coefficients, of size 1 x n (row vector)
    
    Return:
    j -- float with the cost for this theta array
    """
    m = [...]
    
    # Remember to check the dimensions of the matrix multiplication to do it correctly
    j = [...]
    
    return j

## Task 2: Implement the optimisation of this cost function using gradient descent

We are now going to solve the optimisation of this cost function to train the model, using the vectorised gradient descent method. The model will be considered trained when its cost function has reached a minimum, stable value.

$$Y = h_\Theta(X) = X \times \Theta^T$$

$$J_\theta = \frac{1}{2m} \sum_{i = 0}^{m} (h_\theta(x^i) - y^i)^2$$

$$\theta_j := \theta_j - \alpha [\frac{1}{m} \sum_{i = 0}^{m}{(h_\theta(x^i) - y^i) x_j^i}]$$

To do this, once again, fill in the code template in the next cell.

Tips:
- If you prefer, you can first implement the function with loops and iterations, and finally in a vectorised way
- Remember the dimensions of each vector/matrix
- Again, record the operations in step-by-step order on a sheet or in an auxiliary cell
- At each step, write down the dimensions of your result, which you can also check in your code
- Use numpy.matmul() for matrix multiplication
- At the start of each training iteration, you must copy all $\Theta$ values, since you are going to iterate by updating each of its values based on the entire vector

In [None]:
# TODO: Implement the function that trains the model using gradient descent

def gradient_descent(x, y, theta, alpha, e, iter_):
    """ Trains the model by optimising its gradient descent cost function
    
    Positional arguments:
    x -- Numpy 2D array with the values of the independent variables from the examples, of size m x n
    y -- Numpy 1D array with the dependent/target variable, of size m x 1
    theta -- Numpy 1D array with the weights of the model coefficients, of size 1 x n (row vector)
    alpha -- float, training rate
    
    Named arguments (keyword):
    e -- float, minimum difference between iterations to declare that the training has finally converged
    iter_ -- int/float, nº of iterations
    
    Return:
    j_hist -- list/array with the evolution of the cost function during the training
    theta -- NumPy array with the value of theta at the last iteration
    """
    # TODO: enters default values for e and iter_ in the function keyword arguments
    
    iter_ = int(iter_)    # If you have entered iter_ in scientific notation (1E3) or float (1000.), converts it
    
    # Initialises j_hist as a list or a NumPy array. Remember that we do not know what size it will eventually be
    # Your max. nº of elements will be the max. nº of iterations
    j_hist = [...]
    
    m, n = [...]    # Obtain m and n from the dimensions of X
    
    for k in [...]:    # Iterate over the maximum nº of iterations
        theta_iter = [...]    # Copy the theta for each iteration with "deep copy" since we have to update it
        
        for j in [...]:    # Iterate over the nº of features
            # Update theta_iter for each feature, according to the derivative of the cost function
            # Include the training rate alpha
            # Careful with the matrix multiplication, its order and dimensions
            theta_iter[j] = theta[j] - [...]
            
        theta = theta_iter    # Updates the entire theta, ready for the next iteration
        
        cost = cost_function([...])    # Calculates the cost for the current iteration of theta
        
        j_hist[...]    # Adds the cost of the current iteration to the cost history
        
        # Check if the difference between the cost of the current iteration and that of the last iteration in absolute value
        # is less than the minimum difference to declare convergence, e, for all iterations
        # except the first
        if k > 0 and [...]:
            print('Converge at iteration nº: ', k)
            
            break
    else:
        print('Max. nº of iterations reached')
        
    return j_hist, theta

## Task 3: Check the implementation of gradient descent

To check your implementation, again, use the same cell, varying its parameters several times, plotting the evolution of the cost function and seeing how its value approaches 0.

In each case, check that the initial and final $\Theta$ are very similar in the following scenarios:
1. It generates several synthetic datasets, testing each of them
1. It modifies the nº of examples and features, m and n
1. It modifies the error parameter, which may cause the initial and final $\Theta$ to not quite match, and the greater the error, the more difference there may be
1. Check the max. nº of iterations or the training rate α hyperparameters, which will make the model take more or less time to train, within minimum and maximum values

In [None]:
# TODO: Genera un dataset sintético, con término de error, de la forma que escojas, con Numpy o Scikit-learn

m = 0
n = 0
e = 0.

X = [...]

Theta_verd = [...]

Y = [...]

# Comprueba los valores y dimensiones (forma o "shape") de los vectores
print('Theta real a estimar:')
print()
print('shape')

print('Primeras 10 filas y 5 columnas de X e Y:')
print()
print()

print('Dimensiones de X e Y:')
print('shape', 'shape')

In [None]:
# TODO: Comprueba tu implementación entrenando un modelo sobre el dataset sintético creado previamente

# Utiliza una theta iniciada aleatoriamente o la Theta_verd, en función del escenario a comprobar
theta_ini = [...]

print('Theta inicial:')
print(theta_ini)

alpha = 1e-1
e = 1e-3
iter_ = 1e3    # Comprueba que tu función puede admitir valores float o modifícalo

print('Hiper-arámetros usados:')
print('Alpha:', alpha, 'Error máx.:', e, 'Nº iter', iter_)

t = time.time()
j_hist, theta_final = gradient_descent([...])

print('Tiempo de entrenamiento (s):', time.time() - t)

# TODO: completar
print('\nÚltimos 10 valores de la función de coste')
print(j_hist[...])
print('\nCoste final:')
print(j_hist[...])
print('\nTheta final:')
print(theta_final)

print('Valores verdaderos de Theta y diferencia con valores entrenados:')
print(Theta_verd)
print(theta_final - Theta_verd)

Representa gráficamente el histórico de la función de coste para comprobar tu implementación:

In [None]:
# TODO: Representa gráficamente la función de coste vs el nº de iteraciones

plt.figure()

plt.title('Función de coste')
plt.xlabel('nº iteraciones')
plt.ylabel('coste')

plt.plot([...])    # Completar

plt.grid()
plt.show()

Para comprobar completamente la implementación de dichas funciones, modifica el dataset sintético original para comprobar que la función de coste y el entrenamiento por gradient descent siguen funcionando correctamente.

P. ej., modifica el nº de ejemplos y el nº de características.

También añádele de nuevo un término de error a la Y. En este caso, puede que la Theta inicial y la final no concuerden del todo, ya que hemos introducido error o "ruido" en el dataset de entrenamiento.

Por último, comprueba todos los hiper-parámetros de tu implementación. Utiliza varios valores de alpha, e, nº de iteraciones, etc., y comprueba que los resultados son los esperados.