# Linear Regression: Synthetic dataset example
M2U2 - Exercise 5

## What are we going to do?
- Use an automatically generated synthetic dataset to check our implementation
- Train a multivariate linear regression ML model
- Check the training evolution of the model
- Evaluate a simple model
- Make predictions about new future examples

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

In [None]:
import time
import numpy as np
from matplotlib import pyplot as plt

## Creation of a synthetic dataset

We are going to create a synthetic dataset to check our implementation.

Following the methods that we have used in previous exercises, create a synthetic dataset using the NumPy method.

Include a controllable error term in that dataset, but initialise it to 0, since to make the first implementation of this multivariate linear regression ML model we do not want any error in the data that could hide an error in our model.

Afterwards, we will introduce an error term to check that our implementation can also train the model under these more realistic circumstances.

### The bias or intercept term

This time, we are going to generate the synthetic dataset with a small modification: we are going to add a first column of 1s to X, or a 1. (float) as the first value of the features of each example.

Furthermore, since we have added one more feature n to the matrix X, we have also added one more feature or value to the vector $\Theta$, so we now have n + 1 features.

Why do we add this column, this new term or feature?

Because this is the simplest way to implement the linear equation in a single linear algebra operation, i.e., to vectorise it.

In this way, we thus convert $Y = m \times X + b$ en $Y = X \times \Theta$, saving us an addition operation and implementing the equation in a single matrix multiplication operation.

The term *b*, therefore, is incorporated as the first term of the vector $\Theta$, which when multiplied by the first column of X, which has a value of 1 for all its rows, allows us to add said term *b* to each example.

In [None]:
# TODO: Generate a synthetic dataset in whatever way you choose, with error term initially set to 0

m = 100
n = 3

# Create a matrix of random numbers in the interval [-1, 1)
X = [...]
# Insert a vector of 1s as the 1st column of X
# Tips: np.insert(), np.ones(), index 0, axis 1...
X = [...]

# Generate a vector of random numbers in the interval [0, 1) of size n + 1 (to add the bias term)
Theta_true = [...]

# Add to the Y vector a random error term in % (0.1= 10%) initialised at 0
# Said term represents an error of +/- said percentage, e.g., +/- 5%,+/- 10%, etc., not just to add
# The percentage error is calculated on Y, therefore the error would be e.g., +3.14% of Y, or -4.12% of Y....
error = 0.

Y = np.matmul(X, Theta_true)
Y = Y + [...] * error

# Check the values and dimensions of the vectors
print('Theta to be estimated and its dimensions:')
print()
print()

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

Note the matrix multiplication operation implemented: $Y = X \times \Theta$

Check the dimensions of each vector: X, Y, $\Theta$.
*Do you think this operation is possible according to the rules of linear algebra?*

If you have doubts, you can consult the NumPy documentation relating to the np.matmul function.

Check the result, perhaps reducing the original number of examples and features, and make sure it is correct.

## Training the model

Copy your implementation of the cost function and its optimisation by gradient descent from the previous exercise:

In [None]:
# TODO: Copy the code of your cost and gradient descent functions

def cost_function(x, y, theta):
    """ Computes the cost function for the considered dataset and coefficients.
    
    Positional arguments:
    x -- Numpy 2D array with the values of the independent variables from the examples, of size m x n +1
    y -- Numpy 1D array with the dependent/target variable, of size m x 1
    theta -- Numpy 1D array with the weights of the model coefficients, of size 1 x n +1 (row vector)
    
    Return:
    j -- float with the cost for this theta array
    """
    pass

def gradient_descent(x, y, theta, alpha, e, iter_):
    """ Train the model by optimising its cost function by gradient descent
    
    Positional arguments:
    x -- Numpy 2D array with the values of the independent variables from the examples, of size m x n +1
    y -- Numpy 1D array with the dependent/target variable, of size m x 1
    theta -- Numpy 1D array with the weights of the model coefficients, of size 1 x n +1 (row vector)
    alpha -- float, training rate
    
    Named arguments (keyword):
    e -- float, minimum difference between iterations to declare that the training has finally converged
    iter_ -- int/float, nº of iterations
    
    Return:
    j_hist -- list/array with the evolution of the cost function during training, of size nº of iterations that the model has used
    theta -- NumPy array with the value of theta at the last iteration, of size 1 x n + 1
    """
    pass

We will use these functions to train our ML model.

Let's remind you of the steps we will follow:
- Start $\Theta$ with random values
- Optimise $\Theta$ by reducing the cost associated with each iteration of its values
- When we have found the minimum value of the cost function, take its associated $\Theta$ as the coefficients of our model

To do this, fill in the code in the following cell:

In [None]:
# TODO: Train your ML model by optimising its Theta coefficients using gradient descent

# Initialise theta with n + 1 random values
theta_ini = [...]

print('Theta initial:')
print(theta_ini)

alpha = 1e-1
e = 1e-4
iter_ = 1e5

print('Hyperparameters to be used:')
print('Alpha: {}, e: {}, max nº. iter: {}'.format(alpha, e, iter_))

t = time.time()
j_hist, theta = gradient_descent([...])

print('Training time (s):', time.time() - t)

# TODO: complete
print('\nLast 10 values of the cost function')
print(j_hist[...])
print('\Final cost:')
print(j_hist[...])
print('\nTheta final:')
print(theta)

print('True values of Theta and difference with trained values:')
print(Theta_true)
print(theta - Theta_true)

Check that the initial $\Theta$ has not been modified. Your implementation must copy a new Python object at each iteration and not modify it during the training.

In [None]:
# TODO: Check that the initial Theta has not been modified

print('Theta initial y theta final:')
print(theta_ini)
print(theta)

### Check the training of the model

To check the training of the model, we will graphically represent the evolution of the cost function, to ensure that there has not been any great jump and that it has been steadily moving towards a minimum value:

In [None]:
# TODO: Plot the evolution of the cost function vs. the number of iterations

plt.figure(1)

plt.title('Función de coste')
plt.xlabel('Iteraciones')
plt.ylabel('Coste')

plt.plot([...])    # Completa los argumentos

plt.show()

## Realizar predicciones

Vamos a utilizar la $\Theta$, el resultado de nuestro proceso de entrenamiento del modelo, para realizar predicciones sobre nuevos ejemplos que llegaran en el futuro.

Generaremos un nuevo conjunto de datos X siguiendo los mismos pasos que hemos seguido anteriormente. Por tanto, si X tiene el mismo nº de características (n + 1) y sus valores están en el mismo rango de la X generada previamente, se comportarán igual que los datos usados para entrenar el modelo.

In [None]:
# TODO: Realiza predicciones usando la theta calculada

# Genera una nueva matriz X con nuevos ejemplos. Usa el mismo nº de características y el mismo rango de valores
# aleatorios, pero un nº de ejemplos menor (p. ej., 25% del original)
# Recuerda añadir el término bias, o una primera columna de 1s a la matriz, de tamaño m x n + 1
X_pred = [...]

# Calcula las predicciones para dichos nuevos datos
y_pred = [...]    # Pista: de nuevo, matmul

print('Predicciones:')
print(y_pred)    # Puedes imprimir todo el vector o sólo los primeros valores, si es demasiado largo

## Evaluación del modelo

Para evaluar el modelo tenemos varias opciones. En este punto, vamos a hacer una evaluación más simple, rápida e informal del mismo. En siguientes módulos del curso veremos cómo evaluar nuestros modelos de una forma más formal y precisa.

Vamos a hacer una evaluación gráfica, para comprobar simplemente que nuestra implementación funciona como esperamos:

In [None]:
# TODO: Representa gráficamente los residuos entre la Y inicial y la Y predicha para los mismos ejemplos

# Realiza predicciones para cada valor de la X original con la theta entrenada por el modelo
Y_pred = [...]

plt.figure(2)

plt.title('Dataset original y predicciones')
plt.xlabel('X')
plt.ylabel('Residuos')

# Calcula los residuos para cada ejemplo
# Recuerda que son la diferencia en valor absoluto entre la Y real y la Y predicha para cada ejemplo
residuos = [...]

# Usa una gráfica con series diferentes: Y de entrenamiento, Y predicha y residuos
# Usa una gráfica de puntos para la Y de entrenamiento, de línea para la Y predicha y de barra para los residuos, superpuestas
plt.scatter([...])

plt.show()

Si nuestra implementación es correcta, nuestro modelo debe haber podido entrenarse correctamente y tener unos resíduos prácticamente nulos, una diferencia prácticamente nula entre los resultados originales (Y) y los resultados que calcularía nuestro modelo.

Sin embargo, como recordamos, en el primer punto hemos creado un dataset con el término de error a 0. Por tanto, cada valor de Y no tiene ninguna diferencia o variación aleatoria sobre su valor real.

En la vida real, sea porque no hemos tenido en cuenta todas las características que afectarían a nuestra variable objetivo, sea porque los datos contienen algún pequeño error, o sea porque, por lo general, los datos no siguen un comportamiento completamente preciso, siempre tendremos algún término de error, más o menos aleatorio.

Por tanto, *¿y si vuelves a la primera celda y modificas tu término de error, y ejecutas de nuevo las siguientes para entrenar y evaluar un nuevo modelo de regresión lineal sobre datos más parecidos a la realidad?*

Comprueba de dicha forma la robustez de tu implementación.