# Logistic Regression: Regularisation
M2U5 - Exercise 5

## What are we going to do?
- We will implement the regularised cost and gradient descent functions
- We will check the training by plotting the evolution of the cost function
- We will find the optimal *lambda* regularisation parameter using validation

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Instructions
Once the unregularised cost function and gradient descent are implemented, we will regularise them and train a full logistic regression model, checking it by validation and evaluating it on a test subset.

In [None]:
import time
import numpy as np

from matplotlib import pyplot as plt

## Create a synthetic dataset for logistic regression

We will create a synthetic dataset with only 2 classes (0 and 1) to test this implementation of a fully trained binary classification model, step by step.

To do this, manually create a synthetic dataset for logistic regression with bias and error term (to have *Theta_true* available) with the code you used in the previous exercise:

In [None]:
# TODO: Manually generate a synthetic dataset with a bias term and an error term
m = 100
n = 1

# Generate a 2D m x n array with random values between -1 and 1
# Insert a bias term as a first column of 1s
X = [...]

# Generate a theta array with n + 1 random values between [0, 1)
Theta_true = [...]

# Calculate Y as a function of X and *Theta_true*
# Transform Y to values of 1 and 0 (float) when Y ≥ 0.0
# Using a probability as the error term, iterate over Y and change the assigned class to its opposite, 1 to 0, and 0 to 1
error = 0.15

Y = [...]
Y = [...]
Y = [...]

# Check the values and dimensions of the vectors
print('Theta and its dimensions to be estimated:')
print()
print()

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

## Implement the sigmoid activation function

Copy your cell with the sigmoid function:

In [None]:
# TODO: Implement the sigmoid function

## Preprocess the data

As we did for linear regression, we will preprocess the data completely, following the usual 3 steps:

- Randomly reorder the data.
- Normalise the data.
- Divide the dataset into training, validation, and test subsets.

You can do this manually or with Scikit-learn's auxiliary functions.

### Randomly rearrange the dataset

Reorder the data in the *X* and *Y* dataset:

In [None]:
# TODO: Randomly reorder the dataset

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Reorder X and Y:')
# Use an initial random state of 42, in order to maintain reproducibility
X, Y = [...]

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

### Normalise the dataset

Implement the normalisation function and normalise the dataset of *X* examples:

In [None]:
# TODO: Normalise the dataset with a normalisation function

# Copy the normalisation function you used in the linear regression exercise
def normalize(x, mu, std):
    pass

# Find the mean and standard deviation of the features of X (columns), except the first column (bias)
mu = [...]
std = [...]

print('Original X:')
print(X)
print(X.shape)

print('Mean and standard deviation of the features:')
print(mu)
print(mu.shape)
print(std)
print(std.shape)

print('Normalized X:')
X_norm = np.copy(X)
X_norm[...] = normalize(X[...], mu, std)    # Normalize only column 1 and the subsequent columns, not column 0
print(X_norm)
print(X_norm.shape)

*Note*: If you had modified your normalize function to calculate and return the values of mu and std, you can modify this cell to include your custom code.

### Divide the dataset into training, validation, and test subsets

Divide the *X* and *Y* dataset into 3 subsets with the usual ratio, 60%/20%/20%.

If your number of examples is much higher or lower, you can always modify this ratio to another ratio such as 50/25/25 or 80/10/10.

In [None]:
# TODO: Divide the X and Y dataset into the 3 subsets following the indicated ratios

ratio = [60, 20, 20]
print('Ratio:\n', ratio, ratio[0] + ratio[1] + ratio[2])

r = [0, 0]
# Tip: the round() function and the x.shape attribute may be useful to you
r[0] = [...]
r[1] = [...]
print('Cutoff indices:\n', r)

# Tip: the np.array_split() function may be useful to you
X_train, X_val, X_test = [...]
Y_train, Y_val, Y_test = [...]

print('Size of the subsets:')
print(X_train.shape)
print(Y_train.shape)
print(X_val.shape)
print(Y_val.shape)
print(X_test.shape)
print(Y_test.shape)

## Implement the sigmoid activation function

Copy your cell with the sigmoid function:

In [None]:
# TODO: Implement the sigmoid function

## Implement the regularised cost function

We are going to implement the regularised cost function. This function will be similar to the one we implemented for linear regression in a previous exercise.

Regularised cost function:

$$ Y = h_\Theta(x) = g(X \times \Theta^T) $$
$$ J(\Theta) = - [\frac{1}{m} \sum\limits_{i=0}^{m} (y^i log(h_\theta(x^i)) + (1 - y^i) log(1 - h_\theta(x^i))] + \frac{\lambda}{2m} \sum_{j=1}^{n} \Theta_j^2 $$

In [None]:
# TODO: Implement the regularised cost function for logistic regression

def regularized_logistic_cost_function(x, y, theta, lambda_=0.):
    """ Computes the cost function for the considered dataset and coefficients
    
    Positional arguments:
    x -- ndarray 2D with the values of the independent variables from the examples, of size m x n
    y -- ndarray 1D with the dependent/target variable, of size m x 1 and values of 0 or 1
    theta -- ndarray 1D with the weights of the model coefficients, of size 1 x n (row vector)
    lambda_ -- regularisation factor, by default 0.
    
    Return:
    j -- float with the cost for this theta array
    """
    m = [...]
    
    # Remember to check the dimensions of the matrix multiplication to perform it correctly
    j = [...]
    
    # Regularise for all Theta except the bias term (the first value)
    j += [...]
    
    return j

Now let's check your implementation in the following scenarios:
1. For *lambda* = 0:
    1. Using *Theta_true*, the cost should be 0.
    1. As the value of *theta* moves away from *Theta_true*, the cost should increase.
1. For *lambda* != 0:
    1. Using *Theta_true*, the cost should be greater than 0.
    1. The higher the *lambda*, the higher the cost.
    1. The increase in cost as a function of lambda is exponential.

In [None]:
# TODO: Test your implementation on the dataset

theta = Theta_true    # Modify and test several values of theta

j = logistic_cost_function(X, Y, theta)

print('Cost of the model:')
print(j)
print('Checked theta and Actual theta:')
print(theta)
print(Theta_true)

Record your experiments and results in this cell (in Markdown or code):

1. Experiment 1
1. Experiment 2
1. Experiment 3
1. Experiment 4
1. Experiment 5

## Train an initial model on the training subset

As we did in previous exercises, we will train an initial model to check that our implementation and the dataset work correctly, and then we will be able to train a model with validation without any problem.

To do this, follow the same steps as you did for linear regression:
- Train an initial model without regularisation.
- Plot the history of the cost function to check its evolution.
- If necessary, modify any of the parameters and retrain the model. You will use these parameters in the following steps.

Copy the cells from previous exercises where you implemented the cost function in unregularised logistic regression and the cell where you trained the model, and modify them for regularised logistic regression.

Recall the gradient descent functions for regularised logistic regression:

$$ Y = h_\Theta(x) = g(X \times \Theta^T) $$
$$ \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_0^i $$
$$ \theta_j := \theta_j - \alpha [\frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_j^i + \frac{\lambda}{m} \theta_j]; \space j \in [1, n] $$
$$ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_j^i; \space j \in [1, n] $$

In [None]:
# TODO: Copy the cell with the gradient descent for unregularised logistic regression and modify it to implement the regularisation

In [None]:
# TODO: Copy the cell where we trained the model
# Train your model on the unregularised training subset and check that it works correctly

In [None]:
# TODO: Plot the evolution of the cost function vs. the number of iterations

plt.figure(1)

### Comprobar la implementación

Comprueba de nuevo tu implementación, al igual que hiciste en el ejercicio anterior.

En esta ocasión, además, comprueba cómo con una *lambda* distinta a 0 la penalización hace que el coste sea mayor cuanto mayor sea esta *lambda*.

### Comprobar si existe desviación o sobreajuste

Al igual que hacíamos en la regresión lineal, vamos a comprobar si existe sobreajuste comparando el coste del modelo en el dataset de entrenamiento y de validación:

In [None]:
# TODO: Comprueba el coste del modelo sobre el dataset de entrenamiento y validación
# Utiliza la Theta_final del modelo entrenado en ambos casos

Recuerda que con un dataset sintético aleatorio es difícil que se diera un caso u otro, pero de esta forma podríamos apreciar dichos problemas de la siguiente forma:

- Si el coste final en ambos subsets es alto, puede haber un problema de desviación o *bias*.
- Si el coste final en ambos subsets es muy diferente entre sí, puede haber un problema de sobreajuste o *varianza*.

## Hallar el hiper-parámetro *lambda* óptimo por validación

Del mismo modo que hemos hecho en ejercicios anteriores, vamos a optimizar nuestro parámetro de regularización por validación.

Para ello vamos a entrenar un modelo diferente por cada valor de *lambda* a considerar sobre el subset de entrenamiento, y evaluar su error o coste final sobre el subset de validación.

Vamos a representar gráficamente el error de cada modelo vs el valor de *lambda* usado e implementar un código que elegirá automáticamente el modelo más óptimo de entre todos.

Recuerda entrenar todos tus modelos en igualdad de condiciones:

In [None]:
# TODO: Entrena un modelo por cada valor de lambda diferente sobre X_train y evalúalo sobre X_val

# Usa de nuevo un espacio logarítmico entre 10 y 10^3 de 10 elementos con valores que comiencen por un decimal no-cero 1 o 3
lambdas = [...]

# Completa el código para entrenar un modelo diferente para cada valor de lambda sobre X_train
# Almacena su theta y error/coste final
# Posteriormente, evalúa su coste total en el subset de validación

# Almacena dicha información en los siguientes ndarrays, del mismo tamaño que lambdas
j_train = [...]
j_val = [...]
theta_val = [...]

In [None]:
# TODO: Representa gráficamente el error final para cada valor de lambda

plt.figure(2)

# Completa con tu código

### Escoger el mejor modelo

Copia el código de ejercicios anteriores, modificándolo si es necesario, para escoger el modelo con mayor precisión sobre el subset de validación:

In [None]:
# TODO: Escoge el modelo y el valor de lambda óptimos, con el menor error sobre el subset de CV

# Itera sobre todas las combinaciones de theta y lambda y escoge las de menor coste en el subset de CV

j_final = [...]
theta_final = [...]
lambda_final = [...]

## Evaluar el modelo sobre el subset de test

Finalmente, vamos a evaluar el modelo sobre un subset de datos que no hemos usado para entrenarlo ni para escoger ningún hiper-parámetro.

Para ello, vamos a calcular el coste o error total sobre el subset de test y comprobar gráficamente los residuos sobre el mismo:

In [None]:
# TODO: Calcula el error del modelo sobre el subset de test usando la función de coste con las correspondientes theta y lambda

j_test = [...]

In [None]:
# TODO: Calcula las predicciones del modelo sobre el subset de test, calcula los residuos y represéntalos frente al índice de ejemplos (m)

# Recuerda usar la función sigmoide para transformar las predicciones
Y_test_pred = [...]

residuos = [...]

plt.figure(3)

# Completa con tu código

plt.show()

## Realizar predicciones sobre nuevos ejemplos

Con nuestro modelo ya entrenado, optimizado y evaluado, lo único que nos queda es ponerlo en funcionamiento realizando predicciones con nuevos ejemplos.

Para ello, vamos a:
- Generar un nuevo ejemplo, siguiendo el mismo patrón que el dataset original.
- Normalizar sus características antes de poder realizar predicciones sobre ellos.
- Generar una predicción para dicho nuevo ejemplo.

In [None]:
# TODO: Genera un nuevo ejemplo siguiendo el patrón original, con término de bias y error aleatorio

X_pred = [...]

# Normaliza sus características (excepto el término de bias) con las medias y desviaciones típicas originales
X_pred = [...]

# Genera una predicción para dicho ejemplo
Y_pred = [...]