# Linear Regression: Comparison of regularisation values
M2U3 - Exercise 3

## What are we going to do?
- Create a synthetic dataset for multivariate linear regression with a random error term
- We will train 3 different linear regression models on this dataset with different *lambda* values
- We will compare the effect of the *lambda* value on the model, its accuracy, and its residuals

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Create a synthetic dataset with error term for training and final testing

VWe will start, as usual, by creating a synthetic dataset for linear regression, with bias and error terms, either manually or with Scikit-learn methods.

This time we are going to create 2 datasets, one for training and one for final test, following the same pattern but with different sizes. We will train the models with the first dataset and then check with the second dataset how they would behave on data that they have not "seen" previously in the training process, which are completely new to them.

In [None]:
# TODO: Generate a synthetic dataset manually, with bias term and error term

m = 100
n = 1

X_train = [...]
X_test = [...]    # The size of the test dataset should be 25% of the original

Theta_true = [...]

error = 0.35

Y_train = [...]
Y_test = [...]

# Check the values and dimensions of the vectors
print('Theta to be estimated and its dimensions:')
print()
print()

# Check X_train, X_test, Y_train e Y_test
print('First 10 rows and 5 columns of X and Y:')
print()
print()
print()
print()

print('Dimensions of X and Y:')
print()
print()

## Train 3 different models with different *lambda* values

We will train 3 different models on this dataset with different *lambda* values.

To do this, start by copying your cells with the code that implements the regularised cost function and gradient descent:

In [None]:
# TODO: Copy here the cells or the code to implement 2 functions with regularised cost function
# and gradient descent

Let's train the models. To do this, remember that with Jupyter you can simply modify the code cells and the variables will remain in the Jupyter kernel memory.

Therefore, you can e.g., modify the name of the following variables, changing "1" to "2" y "3", and simply re-execute the cell to store the results of the 3 models, while the variables of the previous models are still available.

If you run into any difficulties, you can also copy the code cell several times and have 3 cells to train 3 models with different variable names.

In [None]:
# TODO: Test your implementation by training a model on the previously created synthetic dataset

# Create an initial theta with a given constant value (not randomly this time).
theta_ini = [...]

print('Theta inicial:')
print(theta_ini)

alpha = 1e-1
lambda_ = [1e-3, 1e-1, 1e1]    # We use 3 different values
e = 1e-3
iter_ = 1e3    # Check that your function can support float values or modify it

print('Hyperparameters used:')
print('Alpha:', alpha, 'Max error:', e, 'Nº iter', iter_)

t = time.time()
# Use lambda_[i], within the range [0, 1, 2] for each model
j_hist_1, theta_final_1 = gradient_descent([...])

print('Training time (s):', time.time() - t)

# TODO: complete
print('\nLast 10 values of the cost function')
print(j_hist_1[...])
print('\Final cost:')
print(j_hist_1[...])
print('\nTheta final:')
print(theta_final_1)

print('True values of Theta and difference with trained valuess:')
print(Theta_true)
print(theta_final_1 - Theta_true)

## Graphically check the effect of lambda on the models

Now let's check the 3 models against each other.

Let's start by checking the final cost, a representation of their accuracy:

In [None]:
# TODO: Show the final cost of the 3 models:

print('Final cost of the 3 models:')
print(j_hist_1[...])
print(j_hist_2[...])
print(j_hist_3[...])

# Visually represent the cost vs. lambda values with a line and dot plot
plt.plot([...])

*How does a higher *lambda* value affect the final cost in this dataset?*

Let's plot the training and test datasets, to check that they follow a similar pattern:

In [None]:
# TODO: Plot X_train vs Y_train, and X_test vs Y_test graphically.

plt.figure(1)

plt.title([...])
plt.xlabel([...])
plt.ylabel([...])

# Remember to use different colours
plt.scatter([...])
plt.scatter([...])

# Create a legend for the different series and their colours

plt.show()

We will now check the predictions of each model on the training dataset, to see how well the line fits the training values in each case:

In [None]:
# TODO: Calculate the predictions for each model on X_train

Y_train_pred1 = [...]
Y_train_pred2 = [...]
Y_train_pred3 = [...]

In [None]:
# TODO: For each model, graphically represent its predictions about X_test

# If you get an error with other notebook charts, use the bottom line of plt.figure() or comment it out
plt.figure(2)

fig, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)
fig.suptitle([...])

# Use different colours for each model

ax1.plot()
ax1.scatter()

ax2.plot()
ax2.scatter()

ax3.plot()
ax3.scatter()

Since the training dataset has an error term, there may be significant differences between the data in the training dataset and the test dataset. You can play with various values of this term to increase or decrease the difference..

Let's check what happens to the predictions when we plot them on the test dataset, on data that the models have not seen before:

In [None]:
# TODO: Calculate predictions for each model on X_test

Y_test_pred1 = [...]
Y_test_pred2 = [...]
Y_test_pred3 = [...]

In [None]:
# TODO: For each model, graphically represent its predictions about X_test

# SIf you get an error with other notebook charts, use the bottom line of plt.figure() or comment it out
plt.figure(3)

fig, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)
fig.suptitle([...])

# Use different colours for each model

ax1.plot()
ax1.scatter()

ax2.plot()
ax2.scatter()

ax3.plot()
ax3.scatter()

What happens? In some cases, depending on the parameters used, it may be more or less easy to discern it.

When the model has a low or zero *lambda* regulation factor, it fits too closely to the data on which it is trained, achieving a very tight curve and maximum accuracy... only on that particular dataset.

However, in real life, data on which we have not trained the model may subsequently arrive that has some small variation on the original data.

In such situations we prefer a higher *lambda* value, which allows us to have a higher accuracy for the new data, even if we lose some accuracy for the training dataset data.

We are therefore looking for a model that can "generalice" and be able to make good predictions about new data, rather than one that simply "memorizes" the results it has already seen.

We can therefore think of regularisation as a student who has the exam questions before sitting the exam:
- If he then gets those questions, he will have a very high mark (or accuracy), as he has already "seen" the questions beforehand.
- Then, if the questions are different, he may still have a high score, depending on how similar they are.
- However, if the questions are totally different, he will get a very low mark, because it is not that he had thoroughly studied the subject, but that his marks were high just because he knew the results beforehand.

## Check the residuals on the final test subset

Plot the residuals for the 3 models graphically. That way you will be able to compare your 3 models on the 2 datasets.

Calculate the residuals for both the training and testing datasets. You can do this with different cells to be able to appreciate their differences simultaneously.

*Tips*:
- Be careful with the scales of the X and Y axes when making comparisons.
- To be able to see them at the same time, you can create 3 horizontal subplots, instead of vertical ones, with "plt.subplots(1, 3)".
- Use different colours for each of the 3 models.

If you do not clearly see such effects on your datasets, you can try modifying the initial values:
- With a larger number of examples, so that the models can be more accurate.
- With a larger error term, so that there is more difference or variation between examples.
- With a smaller size test dataset over the training, so that there are more differences between the two datasets (having more data, the values can be more smoothed).
- Etc.