# Logistic Regression: Regularisation
M2U5 - Exercise 5

## What are we going to do?
- We will implement the regularised cost and gradient descent functions
- We will check the training by plotting the evolution of the cost function
- We will find the optimal *lambda* regularisation parameter using validation

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Instructions
Once the unregularised cost function and gradient descent are implemented, we will regularise them and train a full logistic regression model, checking it by validation and evaluating it on a test subset.

In [None]:
import time
import numpy as np

from matplotlib import pyplot as plt

## Create a synthetic dataset for logistic regression

We will create a synthetic dataset with only 2 classes (0 and 1) to test this implementation of a fully trained binary classification model, step by step.

To do this, manually create a synthetic dataset for logistic regression with bias and error term (to have *Theta_true* available) with the code you used in the previous exercise:

In [None]:
# TODO: Manually generate a synthetic dataset with a bias term and an error term
m = 100
n = 1

# Generate a 2D m x n array with random values between -1 and 1
# Insert a bias term as a first column of 1s
X = [...]

# Generate a theta array with n + 1 random values between [0, 1)
Theta_true = [...]

# Calculate Y as a function of X and *Theta_true*
# Transform Y to values of 1 and 0 (float) when Y ≥ 0.0
# Using a probability as the error term, iterate over Y and change the assigned class to its opposite, 1 to 0, and 0 to 1
error = 0.15

Y = [...]
Y = [...]
Y = [...]

# Check the values and dimensions of the vectors
print('Theta and its dimensions to be estimated:')
print()
print()

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

## Implement the sigmoid activation function

Copy your cell with the sigmoid function:

In [None]:
# TODO: Implement the sigmoid function

## Preprocess the data

As we did for linear regression, we will preprocess the data completely, following the usual 3 steps:

- Randomly reorder the data.
- Normalise the data.
- Divide the dataset into training, validation, and test subsets.

You can do this manually or with Scikit-learn's auxiliary functions.

### Randomly rearrange the dataset

Reorder the data in the *X* and *Y* dataset:

In [None]:
# TODO: Randomly reorder the dataset

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Reorder X and Y:')
# Use an initial random state of 42, in order to maintain reproducibility
X, Y = [...]

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

### Normalise the dataset

Implement the normalisation function and normalise the dataset of *X* examples:

In [None]:
# TODO: Normalise the dataset with a normalisation function

# Copy the normalisation function you used in the linear regression exercise
def normalize(x, mu, std):
    pass

# Find the mean and standard deviation of the features of X (columns), except the first column (bias)
mu = [...]
std = [...]

print('Original X:')
print(X)
print(X.shape)

print('Mean and standard deviation of the features:')
print(mu)
print(mu.shape)
print(std)
print(std.shape)

print('Normalized X:')
X_norm = np.copy(X)
X_norm[...] = normalize(X[...], mu, std)    # Normalize only column 1 and the subsequent columns, not column 0
print(X_norm)
print(X_norm.shape)

*Note*: If you had modified your normalize function to calculate and return the values of mu and std, you can modify this cell to include your custom code.

### Divide the dataset into training, validation, and test subsets

Divide the *X* and *Y* dataset into 3 subsets with the usual ratio, 60%/20%/20%.

If your number of examples is much higher or lower, you can always modify this ratio to another ratio such as 50/25/25 or 80/10/10.

In [None]:
# TODO: Divide the X and Y dataset into the 3 subsets following the indicated ratios

ratio = [60, 20, 20]
print('Ratio:\n', ratio, ratio[0] + ratio[1] + ratio[2])

r = [0, 0]
# Tip: the round() function and the x.shape attribute may be useful to you
r[0] = [...]
r[1] = [...]
print('Cutoff indices:\n', r)

# Tip: the np.array_split() function may be useful to you
X_train, X_val, X_test = [...]
Y_train, Y_val, Y_test = [...]

print('Size of the subsets:')
print(X_train.shape)
print(Y_train.shape)
print(X_val.shape)
print(Y_val.shape)
print(X_test.shape)
print(Y_test.shape)

## Implement the sigmoid activation function

Copy your cell with the sigmoid function:

In [None]:
# TODO: Implement the sigmoid function

## Implement the regularised cost function

We are going to implement the regularised cost function. This function will be similar to the one we implemented for linear regression in a previous exercise.

Regularised cost function:

$$ Y = h_\Theta(x) = g(X \times \Theta^T) $$
$$ J(\Theta) = - [\frac{1}{m} \sum\limits_{i=0}^{m} (y^i log(h_\theta(x^i)) + (1 - y^i) log(1 - h_\theta(x^i))] + \frac{\lambda}{2m} \sum_{j=1}^{n} \Theta_j^2 $$

In [None]:
# TODO: Implement the regularised cost function for logistic regression

def regularized_logistic_cost_function(x, y, theta, lambda_=0.):
    """ Computes the cost function for the considered dataset and coefficients
    
    Positional arguments:
    x -- ndarray 2D with the values of the independent variables from the examples, of size m x n
    y -- ndarray 1D with the dependent/target variable, of size m x 1 and values of 0 or 1
    theta -- ndarray 1D with the weights of the model coefficients, of size 1 x n (row vector)
    lambda_ -- regularisation factor, by default 0.
    
    Return:
    j -- float with the cost for this theta array
    """
    m = [...]
    
    # Remember to check the dimensions of the matrix multiplication to perform it correctly
    j = [...]
    
    # Regularise for all Theta except the bias term (the first value)
    j += [...]
    
    return j

Now let's check your implementation in the following scenarios:
1. For *lambda* = 0:
    1. Using *Theta_true*, the cost should be 0.
    1. As the value of *theta* moves away from *Theta_true*, the cost should increase.
1. For *lambda* != 0:
    1. Using *Theta_true*, the cost should be greater than 0.
    1. The higher the *lambda*, the higher the cost.
    1. The increase in cost as a function of lambda is exponential.

In [None]:
# TODO: Test your implementation on the dataset

theta = Theta_true    # Modify and test several values of theta

j = logistic_cost_function(X, Y, theta)

print('Cost of the model:')
print(j)
print('Checked theta and Actual theta:')
print(theta)
print(Theta_true)

Record your experiments and results in this cell (in Markdown or code):

1. Experiment 1
1. Experiment 2
1. Experiment 3
1. Experiment 4
1. Experiment 5

## Train an initial model on the training subset

As we did in previous exercises, we will train an initial model to check that our implementation and the dataset work correctly, and then we will be able to train a model with validation without any problem.

To do this, follow the same steps as you did for linear regression:
- Train an initial model without regularisation.
- Plot the history of the cost function to check its evolution.
- If necessary, modify any of the parameters and retrain the model. You will use these parameters in the following steps.

Copy the cells from previous exercises where you implemented the cost function in unregularised logistic regression and the cell where you trained the model, and modify them for regularised logistic regression.

Recall the gradient descent functions for regularised logistic regression:

$$ Y = h_\Theta(x) = g(X \times \Theta^T) $$
$$ \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_0^i $$
$$ \theta_j := \theta_j - \alpha [\frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_j^i + \frac{\lambda}{m} \theta_j]; \space j \in [1, n] $$
$$ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_j^i; \space j \in [1, n] $$

In [None]:
# TODO: Copy the cell with the gradient descent for unregularised logistic regression and modify it to implement the regularisation

In [None]:
# TODO: Copy the cell where we trained the model
# Train your model on the unregularised training subset and check that it works correctly

In [None]:
# TODO: Plot the evolution of the cost function vs. the number of iterations

plt.figure(1)

### Check the implementation

Check your implementation again, as you did in the previous exercise.

On this occasion, it also shows that for a *lambda* other than 0, the higher the *lambda* the higher the cost will be, due to the penalty.

### Check for deviation or overfitting

As we did in linear regression, we will check for overfitting by comparing the cost of the model on the training and validation datasets:

In [None]:
# TODO: Check the cost of the model on the training and validation datasets
# Use the Theta_final of the trained model in both cases

Remember that with a random synthetic dataset it is difficult to have one or the other, but by proceeding in this way we will be able to identify the following problems:

- If the final cost in both subsets is high, there may be a problem with deviation or *bias*.
- If the final costs in both subsets are very different from each other, there may be a problem with overfitting or *variance*.

## Find the optimal *lambda* hyperparameter using validation

As we have done in previous exercises, we will optimise our regularisation parameter by validation.

To do this, we will train a different model on the training subset for each lambda value to be considered, and evaluate its error or final cost on the validation subset.

We will plot the error of each model vs. the *lambda* value used and implement a code that will automatically choose the most optimal model among all of them.

Remember to train all your models under equal conditions:

In [None]:
# TODO: Train a model on X_train for each different lambda value and evaluate it on X_val

# Use a logarithmic space between 10 and 10^3 with 10 elements with non-zero decimal values starting with a 1 or a 3
lambdas = [...]

# Complete the code to train a different model for each class and value of lambda on X_train
# Store your theta and final cost/error
# Afterwards, evaluate its total cost on the validation subset

# Store this information in the following arrays, which are the same size as lambda’s arrays
j_train = [...]
j_val = [...]
theta_val = [...]

In [None]:
# TODO: Plot the final error for each value of lambda

plt.figure(2)

# Fill in your code

### Choosing the best model

Copy the code from previous exercises and modify it to choose the most accurate model on the validation subset for each class:

In [None]:
# TODO: Choose the optimal model and lambda value, with the lowest error on the CV subset

# Iterate over all the combinations of theta and lambda and choose the one with the lowest cost on the CV subset

j_final = [...]
theta_final = [...]
lambda_final = [...]

## Evaluate the model on the test subset

Finally, we will evaluate the model on a subset of data that we have not used for its training nor to choose any of its hyperparameters.

Therefore, we will calculate the total error or cost on the test subset and graphically check the residuals of the model on it:

In [None]:
# TODO: Calculate the model error on the test subset using the cost function with the corresponding theta and lambda

j_test = [...]

In [None]:
# TODO: Calculate the predictions of the model on the test subset, calculate the residuals and plot them against the index of examples (m)

# Remember to use the sigmoid function to transform the predictions
Y_test_pred = [...]

residuals = [...]

plt.figure(3)

# Fill in your code

plt.show()

## Make predictions about new examples

With our model trained, optimised, and evaluated, all that remains is to put it to work by making predictions with new examples.

To do this, we will:
- Generate a new example, which follows the same pattern as the original dataset.
- Normalise its features before making predictions about them.
- Generate a prediction for this new example.

In [None]:
# TODO: Generate a new example following the original pattern, with a bias term and a random error term.

X_pred = [...]

# Normalise its features (except the bias term) with the original means and standard deviations
X_pred = [...]

# Generate a prediction for this new example
Y_pred = [...]