# Building an Artificial Neural Network
In this notebook, we will write our own code for ANNs and implement the algorithm of backpropagation.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

%matplotlib inline

## Dataset

We will use `sklearn`'s `make_circles` to create a dataset with 2-D points forming two concentric circles. `X` saves the 2D coordinates for all the points, and `y` is the vector determining which circle each point belongs.

In [None]:
from matplotlib.pyplot import cm
from sklearn import datasets

n = 500
p = 2

X, y = datasets.make_circles(n_samples=n, shuffle=True, factor=0.6, noise = 0.05, random_state=13)
y = y[:, np.newaxis] #convertim y into a column vector
c = 2
#X, y = datasets.make_moons(n_samples=500, shuffle=True, noise=0.05, random_state=13)
#c = 2
#c = 4
#X, y = datasets.make_blobs(n_samples=500, shuffle=True, centers=c, random_state=13)

print("Dimensions de X:",X.shape)
print("Dimensions de Y:",y.shape)

colors = iter(cm.rainbow(np.linspace(0, 1, c)))

for yp in np.arange(c):
    yp_points = y[:,0] == yp
    plt.scatter(X[yp_points, 0],X[yp_points, 1], c=np.array([next(colors)]) )


plt.show()

## Neural network's layer (class `nn_layer`)

We can create a class to store the elements of each layer:
- a matrix of weights `W` of size ($M\times N$), where $M$ is the number of neurons in the previous layer (inputs to each neuron in the current layer) and $N$ is the number of neurons in the current layer.
- a vector of bias terms `b` of size ($1\times N$)
- the activation function (and its derivative)

In [None]:
class NNLayer:
    '''
    Create random weights and bias terms in the interval [-1,1].
    m: number of neurons in the previous layer (inputs to each neuron in the current layer)
    n: number of neurons in the current layer
    act_f: activation function (act_f[0]) and its derivative (act_f[1])
    '''
    def __init__(self, m, n, act_f,rnd_gen=np.random.default_rng()):
        self.b = rnd_gen.random((1, n))*2-1
        self.W = rnd_gen.random((m, n))*2-1
        self.act_f = act_f

## Neural network

We can create a neural network as a sequence of layers. We only need to know the number of layers, the number of neurons in each one, and the corresponding activation function.

El vector ```topology``` que és una llista de capes (0 l'inicial o entrada, 1..n-2 les ocultes i n-1 la final o sortida), cada element del vector té una tupla formada per:
        
1. El número de neurones de cada capa

2. La Funció d'activació de la capa.

El resultat serà una llista de ```neural_layer```.



In [None]:
class ANN:
    '''
    Create a NN
    in_dim: input dimension
    n_neurons: vector with the number of neurons in each layer
    act_fs: activation function for each layer
    '''
    def __init__(self, in_dim, n_neurons, act_fs, rnd_gen=np.random.default_rng()):
        self.layers = []
        self.layers.append(NNLayer(in_dim,n_neurons[0],act_fs[0], rnd_gen))
        for l in range(1,len(n_neurons)):
            self.layers.append(NNLayer(n_neurons[l-1], n_neurons[l] , act_fs[l], rnd_gen))

### Forward Propagation
 
Forward propagation is nothing more than running the ANN with input data.

In [None]:
'''
Performs forward propagation
X: input data (n,m): where n is the number of cases, and m the input dimension
store_all: stores and returns intermediate results
'''
def forward_propagation(self, X, store_all=True):
    # We might want to store all the intermediate results to reuse later (backpropagation)
    self.intermediate_results = []
    if store_all:
        self.intermediate_results.append((None,X))

    prev_out = X
    for l in range(len(self.layers)):
        # We first calculate the linear combination
        # @ operator is the matrix multiplication
        # the final z matrix will be size (n_cases x n_neurons)
        z = prev_out @ self.layers[l].W + self.layers[l].b

        # We apply the activation function
        # the final matrix, a, will be size (n_cases x n_neurons)
        a = #### YOUR CODE HERE ####

        prev_out = a
        if store_all:
            self.intermediate_results.append((z,a))

    return prev_out

# add method to the ANN class
ANN.forward_propagation = forward_propagation

Now we have defined the basic operative of neural networks: the configuration and input processing.

Let's assume we need an ANN with two hidden layers and an output layer, where both hidden layers have 3 neurons each and the output just one neuron, such as:

<img src="images/ann_graph.png" alt="ann graph" style="width: 600px;"/>

Each neuron linearly combines the inputs and applies a non-linear transformation $\sigma$ to it.

$$\phi_j^{(l)}(\textbf{a}^{(l-1)})=\sigma\left( \sum_{k=1}^{N^{(l-1)}} w^{(l)}_{kj}\cdot a^{(l-1)}_{k}+ b^{(l)}_j\right)$$

for the $j$-th neuron of layer $l$, where $\textbf{a}^{(l-1)}$ is the vector of outputs of the previous layer (length $N^{(l-1)}$). The input vector $\textbf{a}^{(l-1)}$ is $(\phi_j^{(l)})_{j=1}^{N^{(l-1)}}$ if $l-1>1$, or just the input data $\textbf{x}$  if $l-1=1$.
The function $\sigma$ is usually the same for each layer's neurons.

**Note that** these steps are repeatedly applied for all the neurons and layers. Thus, we can represent them in matrix form:

$$\textbf{Z}^{(l)}=\textbf{A}^{(l-1)}\times \textbf{W}^{(l)}+\textbf{b}^{(l)}$$

where $\times$ means <a href="https://en.wikipedia.org/wiki/Matrix_multiplication" target="blank">matrix multiplication</a> and it sums the bias vector $\textbf{b}^{(l)}$ to each row of the resulting matrix. From this matrix $\textbf{Z}^{(l)}$, the layer $l$'s output matrix:
$$\textbf{A}^{(l)}=\sigma\left(\textbf{Z}^{(l)}\right)$$
is just the element-wise application of function $\sigma$ to $\mathbf{Z}^{(l)}$, $A^{(l)}_{jk}=\sigma\left(Z^{(l)}_{jk}\right)$.

The sequence of vectors and matrices required in the whole computation can be graphically represented as:
<img src="images/ann_matrix.png" alt="ann matrix" style="width:1200px;"/>


Remember that we have left a basic component undefined: the activation function $\sigma$. This is the non-linear transformation that is applied to each linear combination in the neurons.

## Activation functions

Let us define the most typical activation functions, starting with the hyperbolic tangent `tanh`.

It is defined as:

$$\phi (z)= \text{tanh}(z) =  \frac{e^{z}-e^{-z}}{e^{-z}+e^{z}}$$

and its first derivative is:
$$\phi' (z)= \text{tanh}'(z) = 1 - \phi^{2}(z) = \frac{4}{(e^{-z}+e^{z})^{2}}$$

We will save them in the same tuple, where the first element is the function itself, and the second element is its first derivative:

In [None]:
tanh = ( lambda z : ((np.e**z)-(np.e**(-z)))/((np.e**(-z))+(np.e**z)),  # function
         lambda z : 4./(((np.e**z)+(np.e**(-z)))**2) )                  # first derivative

The `sigmoid` function is also a standard one. It is defined as:

$$\phi (z)= \text{sigmoid}(z) = \frac{1}{1+e^{-z}}$$

and its first derivative is:

$$\phi' (z)= \text{sigmoid}'(z) = \phi (z) (1 - \phi(z)) = \frac{e^{-z}}{(1+e^{-z})^{2}}$$

In [None]:
sigmoid = ( lambda z : 1./(1.+(np.e**(-z))),                  # function
            lambda z : #### YOUR CODE HERE #### )  # first derivative

A modern popular activation function is the Rectified Linear Unit (`ReLu`), defined as:

$$\phi (z)= \text{ReLu}(z) = \max{(0,z)}$$

and its first derivative is: 

$$\phi' (z)= \text{ReLu}'(z) = \begin{cases} 0, & \text{if } z\leq 0 \\ 1, & \text{otherwise} \end{cases}$$

Note that it is really undefined at $z=0$. However, by convention, we assume $\phi'(0)=0$.

In [None]:
ReLu = ( lambda z : np.maximum(0,z),  # function
         lambda z : z>0 )             # first derivative (assuming bool->int casting)

Let's have a look to them:

In [None]:
z = np.linspace(-5,5, 100)

fig, axs = plt.subplots(1, 3, figsize=(15,5), facecolor='w', edgecolor='k')
axs[0].plot(z, tanh[0](z),'blue', label="tanh")
axs[0].plot(z, tanh[1](z),'orange', label="tanh'")
axs[0].title.set_text('tanh')
axs[0].legend()

axs[1].plot(z, sigmoid[0](z),'blue', label="sigmoid")
axs[1].plot(z, sigmoid[1](z),'orange', label="sigmoid'")
axs[1].title.set_text('sigmoid')
axs[1].legend()

axs[2].plot(z, ReLu[0](z),'blue', label="ReLu")
axs[2].plot(z, ReLu[1](z),'orange', label="ReLu'")
axs[2].title.set_text('ReLu')
axs[2].legend()

plt.show()

Note that the range of output values for the derivative of the `sigmoid` function shrinks to a maximum of $0.25$, complicating the flow of gradients in backpropagation (#vanishing_gradients).

Finally, for multi-class classification tasks, usually the activation function of the last layer is `softmax`:

$$\phi(z_i)= \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^ke^{z_j}}$$

and its first derivative is: 

$$\phi' (z)= \text{softmax}'(z) =\frac{\partial \phi}{\partial z_j}(z_i)= 
\begin{cases} 
\phi(z_i)(1-\phi(z_i)), & \text{if } i=j \\ 
-\phi(z_i)\phi(z_j), & \text{otherwise}
\end{cases}$$

In [None]:
softmax = ( lambda z : (np.e**z)/np.sum((np.e**z)),                            # function
            lambda z : np.diag(softmax[0](z))-softmax[0](z)*softmax[0](z).T )  # first derivative

## Building an ANN
We now have all the components for building a neural network with two hidden layers with 3 neurons each and `tanh` activation, and a single-neuron output layer with `sigmoid` activation:

In [None]:
my_ann = ANN(X.shape[1], [3,3,1], [ReLu, ReLu, sigmoid])

and we can even run it and get some predictions for our data:

In [None]:
my_ann.forward_propagation(X, True)

But, **wait**! We are using random weights and bias terms in all the neurons. These predictions are just random guesses.

We want to set these parameters. We call this the **learning** step, and for ANNs we use backpropagation.

# Learning ANN's parameters

We want the model's parameters that lead to the best possible solution. So, the first thing we need to set is how we decide what is the best solution. And we usually do that through a **loss function**: the best model is the one that minimizes the loss function. Note that this is also the objective function of an optimization problem.

Once we have defined this, **backpropagation** is a method that allows us to update (improve) the model's parameters according to that objective (loss function). That is, it enables the application of a single *gradient descent* step.

We start with randomly selected parameters. Then and until the model is good enough, backpropagation is repeatedly executed to calculate the gradients that allow us to compute the parameters' updates.

## Loss function
The loss function gives a score to each parametrized model. We can compare different models (different sets of parameters for our ANN) by comparing their losses.

There are many loss functions and they are different depending on the task (regression or classification). Let's define the most typical ones. 

The **Mean Squared Error** (MSE) is a **regression** loss function defined as:

$$\mathcal{L}_{MSE}(\hat{\mathbf{y}},\mathbf{y}) = \frac{1}{n}\sum_{i = 1}^{n}\left ( \hat{y}_{i} - y_{i} \right )^{2}/2$$

where we include a constant division by 2 (which does not modify the objective function) that simplifies the gradient. Its derivative with respect to $\hat{\mathbf{y}}$ is:

$$\mathcal{L}_{MSE}'(\hat{\mathbf{y}},\mathbf{y}) = \frac{\partial \mathcal{L}_{MSE}}{\partial \hat{y}_{i}}(\hat{\mathbf{y}},\mathbf{y})=  \hat{y}_{i} - y_{i}$$

In [None]:
mse_loss = ( lambda yr, yh: np.mean(((yh-yr)**2)/2.), 
             lambda yr, yh: (yh-yr) )

The **cross-entropy** (CE) is a **classification** loss function defined as:

$$\mathcal{L}_{CE}(\hat{\mathbf{Y}},\mathbf{Y}) = -\frac{1}{n}\sum_{i = 1}^{n}\sum_{j = 1}^{m}Y_{i,j}\log\hat{Y}_{i,j}$$

assuming that each $Y_{i,\cdot}$ is a $k$-sized vector such that $\sum_{j=1}^kY_{i,j}=1$ and $\forall j, Y_{i,j}\geq 0$. That is a standard output of a probabilistic classifier in multi-class classification. The derivative with respect to any component $\hat{Y}_{i,j}$ is:

$$\mathcal{L}_{CE}'(\hat{\mathbf{Y}},\mathbf{Y}) =\frac{\partial \mathcal{L}_{CE}}{\partial \hat{Y}_{ij}}(\hat{\mathbf{Y}},\mathbf{Y})= -\frac{Y_{i,j}}{\hat{Y}_{i,j}}$$

In a binary classification task our output is usually a single probability value, $y_i$ for instance $i$. Specifically, $y_i$ is the probability of the positive class for instance $i$, whereas the probability of the negative class would be $1-y_i$. With this in mind, the **binary cross-entropy** (BCE) has a simpler definition:
$$\mathcal{L}_{BCE}(\hat{\mathbf{y}},\mathbf{y}) = - \frac{1}{n}\sum_{i = 1}^{n}\left ( y_{i}\log\hat{y}_{i}+(1-y_{i})\log(1-\hat{y}_{i}) \right )$$
and its derivative is:

$$\mathcal{L}_{BCE}'(\hat{\mathbf{y}},\mathbf{y}) = \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i}}(\hat{\mathbf{y}},\mathbf{y})= - \frac{y_{i}}{\hat{y}_{i}}+\frac{1-y_{i}}{1-\hat{y}_{i}}$$



In [None]:
ce_loss = ( lambda yr, yh: - np.mean(np.sum(yr*np.log(yh)),axis=-1), 
            lambda yr, yh: - np.sum(yr/yh,axis=-1) )

bce_loss = ( lambda yr, yh: - np.mean(yr*np.log(yh)+(1-yr)*np.log(1-yh)), 
             lambda yr, yh: -yr/yh + (1-yr)/(1-yh))

## Backpropagation

Now that we have defined the concept of ANN and the loss function, we can define the algorithm of **backpropagation**. It allows us to perform a single update step for the ANN's parameters, as an iteration of GD does.
The algorithm applies the *chain rule*: given a composition of functions $f(g(h(x)))$, the derivative $df/dx$ is given by $$\frac{df}{dx}=\frac{df}{dg}\cdot \frac{dg}{dh} \cdot \frac{dh}{dx}$$

Thus, for an output's layer linear combination's parameter $w_{k1}$, the derivative of the objective function (loss) with respect to $w_{k1}$ is:
$$\frac{\partial\mathcal{L}}{\partial w_{k1}}=\frac{\partial\mathcal{L}}{\partial a^{(l)}_{1}} \cdot \frac{\partial a^{(l)}_{1}}{\partial z^{(l)}_{1}} \cdot \frac{\partial z^{(l)}_{1}}{\partial w^{(l)}_{k1}}$$

where $z^{(l)}_1$ is the output of the linear combination performed in the output layer (assuming single-neuron layer) and $a^{(l)}_1$ is the output of the corresponding non-linear activation function (note that this will be our prediction, $a^{(l)}_1=\hat{y}$). 

For parameters from other layers, we need to keep applying the chain rule backward (back-propagating the gradient). For these calculations, we need the partial derivatives of each function involved, as well as the partial results of the previous forward propagation step that allowed us to compute the loss. And, for sure, you need the parameters that you'd like to update.

In **matrix form**, as this will be performed for multiple weights, neurons and, possibly, for multiple input samples too, it can be described as:
$$\nabla\mathbf{W}^{(l)} =\frac{\partial\mathcal{L}}{\partial \mathbf{A}^{(l)}} \cdot \frac{\partial \mathbf{A}^{(l)}}{\partial \mathbf{Z}^{(l)}} \cdot \frac{\partial \mathbf{Z}^{(l)}}{\partial \mathbf{W}^{(l)}}$$

which can be shown to be:

$$\nabla\mathbf{W}^{(l)}=\frac{1}{N}(\mathbf{y}-\mathbf{A}^{(l)}) \cdot 
\left\{\!\begin{aligned}
&1 &\text{ if } \mathbf{Z}^{(l)}>0\\
&0 &\text{ if } \mathbf{Z}^{(l)}\leq 0
\end{aligned}\right\}
\cdot \mathbf{A}^{(l-1)}$$

if *mean squared error* is used as loss function and `ReLu` is used as activation function. And, for bias terms $\mathbf{b}$ it is just:
$$\nabla\mathbf{W}^{(l)}=\frac{1}{N}(\mathbf{y}-\mathbf{A}^{(l)}) \cdot 
\left\{\!\begin{aligned}
&1 &\text{ if } \mathbf{Z}^{(l)}>0\\
&0 &\text{ if } \mathbf{Z}^{(l)}\leq 0
\end{aligned}\right\}
\cdot \mathbf{1}$$


For the previous layer, it can be described as:
$$\nabla\mathbf{W}^{(l)} =\frac{\partial\mathcal{L}}{\partial \mathbf{A}^{(l)}} \cdot 
\frac{\partial \mathbf{A}^{(l)}}{\partial \mathbf{Z}^{(l)}} \cdot 
\frac{\partial \mathbf{Z}^{(l)}}{\partial \mathbf{A}^{(l-1)}} \cdot 
\frac{\partial \mathbf{A}^{(l-1)}}{\partial \mathbf{Z}^{(l-1)}} \cdot 
\frac{\partial \mathbf{Z}^{(l-1)}}{\partial \mathbf{W}^{(l-1)}}$$


which can be shown to be:

$$\nabla\mathbf{W}^{(l)}=\frac{1}{N}(\mathbf{y}-\mathbf{A}^{(l)}) 
\cdot \left\{\!\begin{aligned}
&1 &\text{ if } \mathbf{Z}^{(l)}>0\\
&0 &\text{ if } \mathbf{Z}^{(l)}\leq 0
\end{aligned}\right\}
\cdot \mathbf{W}^{(l)} 
\cdot \left\{\!\begin{aligned}
&1 &\text{ if } \mathbf{Z}^{(l-1)}>0\\
&0 &\text{ if } \mathbf{Z}^{(l-1)}\leq 0
\end{aligned}\right\}
\cdot \mathbf{A}^{(l-2)}$$

**Note that** most of the computation is repeated for updating the different parameters. Thus, organizing wisely the calculations to avoid repetition can save you a lot of computation.

In the following implementation, in each iteration we keep the gradient updated backward by the chain rule up to a specific layer. We compute the update of the corresponding layer's parameters and keep computing the gradient backward.

In [None]:
'''
Performs backpropagation
y: real output with dim=(n,1): where n is the number of cases
loss_func: loss function and its derivative to be used
lr: learning rate for gradient descent steps
'''
def back_propagation(self, y, loss_func, lr=0.01):
    rl = len(self.intermediate_results) - 1
    if rl < 0: 
        return None;
    # we use two indices to track the layer because we store self.layers in 0..L-1 (L:num.layers), while we store 
    # their outputs self.intermediate_results in 1..L, with self.intermediate_results[0][1]=X (input data)
    l = rl-1
    # chain rule's first step: [d Loss / d A^l]
    grad = loss_func[1](y,self.intermediate_results[rl][1]) 
    while (l >= 0):
        # chain rule: product by [d A^l / d Z^l]
        grad = grad * self.layers[l].act_f[1](self.intermediate_results[rl][0])

        # To update weights W: product by [d Z^l / d W^l]
        upd_W = (self.intermediate_results[rl-1][1].T @ grad)/grad.shape[0]
        # To update bias terms b: product by [d Z^l / d b^l]
        upd_b = np.mean(grad,axis=0,keepdims=True) # equivalent to np.ones((k,N))@grad/N
        
        # chain rule: product by [d Z^l / d A^(l-1)]
        grad = grad @ self.layers[l].W.T

        # gradient descent step
        self.layers[l].W = #### YOUR CODE HERE ####
        self.layers[l].b = #### YOUR CODE HERE ####

        l-=1
        rl-=1

# add method to the ANN class
ANN.back_propagation = back_propagation

The sequence of vectors and matrices required in the **first iteration** can be graphically represented as:
<img src="images/backpropagation_it1.png" style="height:400px"/>

where the first step is to compute the gradient of the loss. The column-vector $\mathbf{1}$ represents a size-$n$ vector of $1$'s. The **next iteration** is a fully hidden layer and the computation can be represented as:

<img src="images/backpropagation_it2.png" style="height:400px"/>

The **final iteration** of this 3-layer ANNs helps update the parameters of the first layer, the ones applied to the input data:

<img src="images/backpropagation_it3.png" style="height:400px"/>


Now, we can run backpropagation:

In [None]:
my_ann.back_propagation(y,bce_loss,0.01)

# Machine learning: training our artificial neural network

We can build a method that groups all the steps required to perform the training:

- Run **forward propagation** to get the intermediate results and the model's prediction for the training data.
- Run **backpropagation** to update the model's parameters.
- Inspect the performance of the learning process by:
  - measuring the loss and accuracy in training and validation data
  - observing the problem's feature space to visualize decision thresholds

All this is repeated multiple times. Each iteration is called an **epoch** in deep learning terminology. 

In [None]:
from IPython.display import clear_output

def train(nn, X_train, y_train, X_test, y_test, loss_func, lr=0.01, epochs=1): 
    
    # Data (50x50 grid) for visualizing currently learned concept
    resolution=50
    X0, X1 = np.meshgrid(np.linspace(-1.5,1.5,resolution), 
                         np.linspace(-1.5,1.5,resolution))
    X_aux = np.vstack((X0.ravel(),X1.ravel())).T

    loss= []
    acc=[]
    loss_test=[]
    acc_test=[]
    # Training loops (run through epochs)
    for i in tqdm(range(epochs)):
        # Run forward to calculate loss and intermediate results
        res = nn.forward_propagation(X_train, True)
        loss.append(loss_func[0](y_train, res))
        y_pred=np.array([1 if x>0.5 else 0 for x in res]) 
        acc.append(np.sum(y_pred==y_train[:,0])/y_pred.shape[0])

        # Run backpropagation to update model's parameters
        nn.back_propagation(y_train, loss_func, lr)
    
        # We calculate the loss in validation data too for later inspection
        res = nn.forward_propagation(X_test, False)
        loss_test.append(loss_func[0](y_test,res))
        y_pred=np.array([1 if x>0.5 else 0  for x in res]) 
        acc_test.append(np.sum(y_pred==y_test[:,0])/y_pred.shape[0])

        if i % 50==0: # from time to time
            # Visualize feature's space to display the currently learned concept
            y_aux = my_ann.forward_propagation(X_aux, False)
            y_aux = y_aux.reshape((resolution,resolution))
            plt.pcolormesh(X0,X1,y_aux,cmap="coolwarm")
            plt.axis("equal")
            clear_output(wait=True)
            plt.show()

    return loss,loss_test,acc,acc_test

## Run
Now we just run the algorithm and visualize the results.

We need to create the train/test split of our data, decide the topology of the ANN and run the `train` method.

Then, we can show the results with different plots:

In [None]:
from sklearn.model_selection import train_test_split

rnd_generator = np.random.default_rng(17)

# set up our ANN architecture
my_ann = ANN(X.shape[1], [5,1], [tanh, sigmoid], rnd_generator)

# split train/validation data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 13, shuffle=True, stratify=y)

# let's train our ANN
loss,loss_test,acc,acc_test= train(my_ann,X_train,y_train,X_test,y_test,bce_loss,lr=0.02,epochs=10000) #Entrenament

# Visualize learning curves (train and validation data) for loss and accuracy
fig, axs = plt.subplots(1, 2, figsize=(15,5), facecolor='w', edgecolor='k')
axs[0].plot(range(len(loss)), loss, 'tab:cyan', label="Train loss")
axs[0].plot(range(len(loss_test)), loss_test, 'tab:brown', label="Validation loss")
axs[0].set_xlabel("Epochs")
axs[0].set_ylabel("Binary Cross Entropy")
axs[0].legend()

axs[1].plot(range(len(acc)), acc, 'tab:cyan', label="Train accuracy")
axs[1].plot(range(len(acc_test)), acc_test, 'tab:brown', label="Validation accuracy")
axs[1].set_xlabel("Epochs")
axs[1].set_ylabel("Accuracy")
axs[1].legend()
plt.show()

# Compare training data's predicted and real outputs
y_pred = my_ann.forward_propagation(X,False)

fig, axs = plt.subplots(1, 2, figsize=(10,5), facecolor='w', edgecolor='k')

axs[0].scatter(X[:,0],X[:,1],c=y_pred>0.5)
axs[0].set_title("Predicted outputs")
axs[1].scatter(X[:,0],X[:,1],c=y)
axs[1].set_title("Real outputs")
plt.show()

# Questions

- In the results, in the learning curve plots, we observe a constant reduction of the loss function, but accuracy does not change as smoothly. In fact, accuracy in validation finally degrades again. Why? Can you prevent this from happening?
- Can you build a stochastic gradient descent method?
- We can use different `sklearn` methods to create synthetic data (moons o blobbs) with more than 2 possible clusters (classes). Modify what is required to make multiclass classification (`softmax` as activation in the last layer and a loss function for multi-class).