# Logistic Regression: Training and predictions
M2U5 - Exercise 4

## What are we going to do?
- We will create a synthetic dataset for logistic regression
- We will preprocess the data
- We will train the model using gradient descent
- We will check the training by plotting the evolution of the cost function
- We will make predictions about new examples

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Instructions
Once the cost function is implemented, we will train a gradient descent logistic regression model, testing our training, evaluating it on a test subset and finally, making predictions on it.

This time we will work with a binary logistic regression, while in other exercises we will consider a multiclass classification.

In [None]:
import time
import numpy as np
from matplotlib import pyplot as plt

## Create a synthetic dataset for logistic regression

We will create a synthetic dataset with only 2 classes (0 and 1) to test this implementation of a fully trained binary classification model, step by step.

To do this, manually create a synthetic dataset for logistic regression with bias and error term (to have *Theta_true* available) with the code you used in the previous exercise:

In [None]:
# TODO: Manually generate a synthetic dataset with a bias term and an error term
m = 100
n = 1

# Generate a 2D m x n array with random values between -1 and 1
# Insert a bias term as a first column of 1s
X = [...]

# Generate a theta array with n + 1 random values between [0, 1)
Theta_true = [...]

# Calculate Y as a function of X and Theta_true
# Transform Y to values of 1 and 0 (float) when Y ≥ 0.0
# Using a probability as the error term, iterate over Y and change the assigned class to its opposite, 1 to 0, and 0 to 1
error = 0.15

Y = [...]
Y = [...]
Y = [...]

# Check the values and dimensions of the vectors
print('Theta and its dimensions to be estimated:')
print()
print()

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

## Implement the sigmoid activation function

Copy your cell with the sigmoid function:

In [None]:
# TODO: Implement the sigmoid function

## Preprocess the data

As we did for linear regression, we will preprocess the data completely, following the usual 3 steps:

- Randomly reorder the data.
- Normalise the data.
- Divide the dataset into training and test subsets.

You can do this manually or with Scikit-learn's auxiliary functions.


### Randomly reorder the dataset

Reorder the data in the *X* and *Y* dataset:

In [None]:
# TODO: Randomly reorder the dataset

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Reorder X and Y:')
# Use an initial random state of 42, in order to maintain reproducibility
X, Y = [...]

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and Y:')
print()

### Normalise the dataset

Implement the normalisation function and normalise the dataset of *X* examples:

In [None]:
# TODO: Normalise the dataset with a normalisation function

# Copy the normalisation function you used in the linear regression exercise
def normalize(x, mu, std):
    pass

# Find the mean and standard deviation of the X features (columns), except for the first one (bias)
mu = [...]
std = [...]

print('Original X:')
print(X)
print(X.shape)

print('Mean and standard deviation of the features:')
print(mu)
print(mu.shape)
print(std)
print(std.shape)

print('Normalised X:')
X_norm = np.copy(X)
X_norm[...] = normalize(X[...], mu, std)    # Normalise only column 1 and the subsequent columns, not column 0
print(X_norm)
print(X_norm.shape)

### Divide the dataset into training and test subsets

Divide the *X* and *Y* dataset into 2 subsets with the usual ratio of 70%/30%.

If your number of examples is much higher or lower, you can always modify this ratio accordingly.

In [None]:
# TODO: Divide the X and Y dataset into the 2 subsets according to the indicated ratio

ratio = [70, 30]
print('Ratio:\n', ratio, ratio[0] + ratio[1])

# Cutoff index
# Tip: the round() function and the x.shape attribute may be useful to you
r = [...]
print('Cutoff indices:\n', r)

# Tip: the np.array_split() function may be useful to you
X_train, X_test = [...]
Y_train, Y_test = [...]

print('Size of the subsets:')
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

## Train an initial model on the training subset

As we did in previous exercises, we will train an initial model to check that our implementation and the dataset work correctly, and then we will be able to train a model with validation without any problem.

To do this, follow the same steps as you did for linear regression:
- Train an initial model without implementing regularisation.
- Plot the history of the cost function to check its evolution.
- If necessary, modify any of the parameters and retrain the model. You will use these parameters in the following steps.

Copy the cells from previous exercises where you implemented the cost function for logistic regression, the unregularised gradient descent for linear regression, and the cell where you trained the regression model, and modify them for logistic regression.

Recall the gradient descent functions for logistic regression:

$$ Y = h_\Theta(x) = g(X \times \Theta^T) $$
$$ \theta_j := \theta_j - \alpha [\frac{1}{m} \sum_{i=0}^{m}(h_\theta (x^i) - y^i) x_j^i] $$

In [None]:
# TODO: Copy the cell with the cost function

In [None]:
# TODO: Copy the cell with the unregularised gradient descent function for linear regression and adapt it for logistic regression

In [None]:
# TODO: Copy the cell where we trained the model
# Train your model on the unregularised training subset

In [None]:
# TODO: Plot the evolution of the cost function vs. the number of iterations

plt.figure(1)

Check your implementation in the following scenarios:
1. Using *Theta_true*, the final cost should be practically 0 and converge in a couple of iterations.
1. As the value of *theta* moves away from *Theta_true*, it should need more iterations to converge, and *theta_final* should be very similar to *Theta_true*.

To do this, remember that you can modify the values of the cells and re-execute them.

Record your experiments and results in this cell (in Markdown or code):
1. Experiment 1
1. Experiment 2

## Evaluate the model on the test subset

Finally, we will evaluate the model on a subset of data that we have not used to train it.

Therefore, we will calculate the total cost or error on the test subset and graphically check the residuals of the model on it:

In [None]:
# TODO: Calculate the error of the model on the test subset using the cost function with the corresponding theta

j_test = [...]

In [None]:
# TODO: Calculate the predictions of the model on the test subset, calculate the residuals and plot them against the index of examples (m)

# Remember to use the sigmoid function to transform the predictions
Y_test_pred = [...]

residuals = [...]

plt.figure(3)

# Fill in your code

plt.show()

## Make predictions about new examples

With our model trained, optimised, and evaluated, all that remains is to put it to work by making predictions with new examples.

To do this, we will:
- Generate a new example, following the same pattern as the original dataset.
- Normalise its features before making predictions about them.
- Generate a prediction for this new example.

In [None]:
# TODO: Generate a new example following the original pattern, with a bias term and a random error term

X_pred = [...]

# Normalise its features (except the bias term) with the original means and standard deviations
X_pred = [...]

# Generate a prediction for this new example
Y_pred = [...]