# Linear Regression: Normalisation
M2U3 - Exercise 1

## What are we going to do?
- We will create a synthetic dataset with features in different value ranges
- We will train a linear regression model on the original dataset
- We will normalise the original dataset
- We will train another linear regression model on the normalised dataset
- We will make a comparison between the training of both models, normalised and non-normalised

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

In [None]:
import time
import numpy as np
from matplotlib import pyplot as plt

## Creation of a synthetic dataset

We are going to manually create a synthetic dataset for linear regression.

Create a synthetic dataset with an error term of 10% of the value over *Y* and an *X* approx. in the range (-1, 1), this time manually, not with the specific Scikit-learn methods, with the code used in previous exercises:

In [None]:
# TODO: Copy code from previous exercises to generate a dataset with a bias term and an error term

m = 1000
n = 4

X = [...]

Theta_verd = [...]

error = 0.1

Y = [...]

In [None]:
# Check the values and dimensions of the vectors
print('Theta and its dimensions to be estimateds:')
print()
print()

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

We will now modify the dataset to ensure that each feature, each column of *X*, has a different order of magnitude and mean.

To do this, multiply each column of *X* (except the first one, the bias, which must be all 1’s) by a different range and add a different bias value to it.

The value we then add up will be the mean of that feaure or column, and the value by which we multiply its range or scale.

P. ej., $X_1 = X_1 * 10^3 + 3.1415926$, where `10^3` would be the mean and `3,1415926` the scale of the feature.

In [None]:
# TODO: For each column of X, multiply it by a range of values and add a different mean to it

# The arrays of ranges and averages must be of length n
# Create an array with the ranges of values, e.g.: 1e0, 1e3, 1e-2, 1e5
ranges = [...]

averages = [...]

X = [...]

print('X with different averages and scales')
print(X)
print(X.shape)

Remember that you can run Jupyter cells in a different order from their position in the document. The brackets to the left of the cells will mark the order of execution, and the variables will always keep their values after the last executed cell, **¡so be careful!**.

## Training and evaluation of the model

Once again, we will train a multivariate linear regression model. This time, we are going to train it first on the original, non-normalised dataset, and then retrain it on the normalised dataset, in order to compare both models and training processes and see the effects of normalisation.

To do this you must copy the cells or code from previous exercises and train a multivariate linear regression model, optimized by gradient descent, on the original dataset.

You must also copy the cells that test the training of the model, representing the cost function vs. the number of iterations.

You do not need to make predictions about this data or evaluate the model’s residuals. In order to compare them, we will do so only on the basis of the final cost.

In [None]:
# TODO: Train a linear regression model and plot the evolution of its cost function
# Use the non-normalised X
# Add the suffix "_no_norm" to the Theta and j_hist variables returned by your model

## Data normalisation

We are going to normalise the data from the original dataset.

To do this, we are going to create a normalisation function that applies the necessary transformation, according to the formula:

$x_j = \frac{x_j - \mu_{j}}{\sigma_{j}}$

In [None]:
# TODO: Implement a normalisation function to a common range and with a mean of 0

def normalize(x, mu, std):
    """ Normalise a dataset with X examples
    
    Positional arguments:
    x -- Numpy 2D array with the examples, no bias term
    mu -- Numpy 1D vector with the mean of each feature/column
    std -- Numpy 1D vector with the standard deviation of each feature/column
    
    Return:
    x_norm -- Numpy 2D array with the examples, and their normalised features
    """
    return [...]

In [None]:
# TODO: Normalise the original dataset using your normalisation function

# # Find the mean and standard deviation of the features of X (columns), except the first column (bias).
mu = [...]
std = [...]

print('original X:')
print(X)
print(X.shape)

print('Mean and standard deviation of the features':')
print(mu)
print(mu.shape)
print(std)
print(std.shape)

print('normalised X:')
X_norm = np.copy(X)
X_norm[...] = normalize(X[...], mu, std)    # Normalise only column 1 and the subsequent columns, not column 0
print(X_norm)
print(X_norm.shape)

*BONUS:*
1. Calculate the means and standard deviations of *X_norm* according to its features/columns.
1. Compare them with those of *X*, *mu*, and *std*
1. Plot the distributions of *X* and *X_norm* in a bar graph or box plot (you can use multiple Matplotlib subplots).

## Retraining the model and comparison of results

Now retrain the model on the normalised dataset. Check the final cost and the iteration at which it converged.

To do this, you can go back to the training cells of the model and check the evolution of the cost function and modify the *X* used for *X_norm*.

In many cases, because it is such a simple model, there may be no noticeable improvement. Depending on the capacity of your working environment, try using a higher number of features and slightly increasing the error term of the dataset.

In [None]:
# TODO: Train a linear regression model and plot the evolution of its cost function
# Use the normalised X
# Add the suffix "_norm" to the Theta and j_hist variables returned by your model

*QUESTION: : Is there any difference in the accuracy and training time of the model on non-normalised data and the model on normalised data? If you increase the error term and the difference in means and ranges between the features, does it make more of a difference?*

## Beware of the original Theta

For the original dataset, before normalisation, the relationship $Y = X \times \Theta$ was fulfilled

However, we have now modified the *X* term of this function..

Therefore, check what happens if you want to recompute *Y* using the normalized *X*:

In [None]:
# TODO: Check for differences between the original Y and the Y computed using the normalized X

# Check the value of Y by multiplying X_norm and Theta_true
Y_norm = [...]

# Check for differences between Y_norm and Y
diff = Y_norm - Y

print('Difference between Y_norm and Y (first 10 rows):')
print(diff[:10])

# Plot the difference between the Ys vs X on a dot plot
[...]

### Make predictions

Similarly, what happens when we are going to use the model to make predictions?

Generate a new dataset *X_pred* following the same method you used for the original *X* dataset, incorporating the bias term, multiplying its features by a range and adding different values to them, without finally normalising the dataset.

Also calculate its *Y_pred_true* (without error term), as the true value of Y to try to predict:

In [None]:
# TODO: Generate a new dataset with fewer examples and the same number of features as the original dataset
# Make sure it has a normalized mean or range across features/columns

X_pred = [...]

Y_pred_true = np.matmul(X_pred, Theta_true)

Now check if there is any difference between the *Y_pred_true* and the *Y_pred* that your model predicts:

In [None]:
# TODO: Check the differences between the actual Y and the predicted Y

Y_pred = np.matmul(X_pred, theta)

diff = Y_pred_true - Y_pred

print('Differences between actual Y and predicted Y:')
print(diff[:10])

Since the predictions are not correct otherwise, we should first normalise the new *X_pred* before generating the predictions:

In [None]:
# TODO: Normalise the X_pred

X_pred[...] = normalize(X_pred[...], mu, std)

print(X_pred[:10,:])
print(X_pred.shape)

This time we have not generated a new, different variable by normalisation, but it remains the variable *X_pred*.

You can then rerun the previous cell to, now that *X_pred* is normalised, check if there is any difference between the actual *Y* and the predicted *Y*.

So always remember:
- The *theta* calculated when training the model will always be relative to the normalised dataset, and cannot be used for the original dataset, since with the same *Y* and a different *X*, Theta must change.
- To make predictions on new examples, we first have to normalise them as well, using the same values for the means and standard deviations that we originally used to train the model.