# __Training of a Neural Network for a classification toy problem.__

#### We import standard libraries and the three classes of the tool box ToyNN (ToyPb, nD_data, ToyNN)

__Look at the companion note or to the dedicated notebook for the description of the classes.__

In [None]:
import numpy as np
from numpy import random as nprd
from matplotlib import pyplot as plt
#from matplotlib import cm as cm
from toyneuralnetwork import *

### We  start by choosing a problem ###

In [None]:
pb = ToyPb(name = "square", bounds = (-1,1))
pb.show_border()

### Next we pick a set of training data and a set of test data that fit the problem. 

In [None]:
ndata = 1000
DATA = nD_data(n = ndata, pb = pb)

ntest = 500
TEST = nD_data(n = ntest, pb = pb, init_pred='yes')

TEST.show_class()
pb.show_border('k--')
plt.legend(loc=1)
plt.show()

### We  choose the number layers and the number of nodes by layer for the neural network (with the constaints of   two input nodes and one output node).

In [None]:
CardNodes = (2, 4, 6, 4, 1)
NN = ToyNN(card = CardNodes, coef_bounds=(-1,1,-1,1), chi="tanh", grid=(-1,1,41))
NN.show()

## ___Full Batch Method___

__We optmize the coefficents of the neural network (stored in__ NN.W __and__ NN.Bias __) in order to impove its predictions. In practice, we optimize__
$$
\dfrac1n\sum_{\text{i}=0}^{n-1}\ell\left(  y_i \widehat f(X_i)    \right),
$$
__where__ $n=\text{DATA.n}$, $\,\ell=\text{pb.loss}$, $y_i=\text{DATA.Y[i]}$,  $X_i=\text{DATA.X[i]}$ __and where the funcion $\widehat f(X)=\text{NN.output}(X)$.__

__We first implement the full gradient method with fixed step.__

### _Parameters for the Full Gradient and initialization_

In [None]:
to = 1        # fixed step size of the gradient descent algorithm
Nepoch = 200    # Number of epochs 

plot_period = 20

NN = ToyNN(card = CardNodes, coef_bounds=(-1,1,-1,1), chi="tanh", grid=(-1,1,41))

cost  = NN.total_loss_and_prediction(pb=pb, DATA=TEST)
title = "Epoch: " + str(0) + ", Cost: " + str(cost)
print(title)
NN.show_pred()
TEST.show_class(pred="ok")
pb.show_border('w--')
plt.plot()

### _Full gradient Iterations_

In [None]:

### Iterations (no sub_iterations)
for epoch in range(1, Nepoch + 1):
    ## Computation of the descent direction
    N=NN.N
    # 0-initialization of the descent vectors
    NN.init_vector()
    # computation and summation over the data of their contributions to the total descent   
    for j in range(ndata):
        Desc_W, Desc_Bias = NN.descent(X=DATA.X[j], y=DATA.Y[j], pb=pb, tau=to)
        NN.add_to_vector(Desc_W, Desc_Bias)
    NN.mult_vector(1/ndata)       # renormalization of the sum of descent vectors      
        # Update of the parameters
    NN.add_vector_to_coefs()


#computation of the error (sum of test losses)
    if epoch%plot_period==0:
        cost = NN.total_loss_and_prediction(DATA=TEST, pb = pb)
        title = "Epoch: " + str(epoch) + ", Cost: " + str(cost)
        print(title)
        NN.show_pred()
        TEST.show_class(pred="yes")
        pb.show_border('w--')
        plt.show()
    else:
        cost  = NN.total_loss(DATA=TEST, pb=pb)
        title = "Epoch: " + str(epoch) + ", Cost: " + str(cost)
        print(title)

NN.show()  

## Exexcice : 

__1/ Implement the Stochastic Gradient Method with constant step.__


__2/ Observe and comment the convergence properties with the full batch metod.__


__3/ Implement the Stochastic Gradient Method with decreasing step sizes:__
$$\tau^k := \dfrac{\gamma \tau^0}{\gamma + k}.$$


__4/ Do you observe an improvement? Do you find an empirical method for the choice $\tau^0$ and $\gamma$?__

__5/ Try the ring problem__ pb = ToyPb(name = "ring", bounds = (-1,1)). __What is the behavior of the full batch method on this problem.__ 

## Question 1

### _Parameters for the Stochastic Gradient Method and for the Neural network_

### _Stochastic Gradient Iterations_

In [None]:
to = .01         # fixed step size of the gradient descent algorithm
Nepoch = 200     # Number of epochs 
ndata = 1000
plot_period = 20

CardNodes = (2, 4, 6, 4, 1)
NN = ToyNN(card = CardNodes, coef_bounds=(-1,1,-1,1), chi="tanh", grid=(-1,1,41))

cost_list = []

### Iterations
for epoch in range(1, Nepoch + 1):
    N=NN.N
    # computation and summation over the data of their contributions to the total descent
    for _ in range(ndata):
        # initialization of the stored descent direction
        NN.init_vector()
        # selection of a random data to perform the gradient descent on
        j = np.random.randint(0, ndata)
        # computation of the descent direction
        NN.descent(X=DATA.X[j], y=DATA.Y[j], pb=pb, tau=to, add_to_vector=True) 
        # update of the coefficients
        NN.add_vector_to_coefs()

#computation of the error (sum of test losses)
    if epoch%plot_period==0:
        cost = NN.total_loss_and_prediction(DATA=TEST, pb = pb)
        title = "Epoch: " + str(epoch) + ", Cost: " + str(cost)
        print(title)
        NN.show_pred()
        TEST.show_class(pred="yes")
        pb.show_border('w--')
        plt.show()
        cost_list.append(cost)
    else:
        cost  = NN.total_loss(DATA=TEST, pb=pb)
        title = "Epoch: " + str(epoch) + ", Cost: " + str(cost)
        print(title)
        cost_list.append(cost)

NN.show()

In [None]:
plt.plot(cost_list)
plt.title("Cost evolution with the epochs (stochastic gradient with constant step)")

## Question 2

On observe qu'à nombre d'époques équivalent, la prédiction semble converger vers la solution (le carré) beaucoup plus rapidement avec la méthode du gradient stochastique, qu'avec la méthode du gradient "full-batch".

## Question 3
### _Stochastic Gradient with decreasing step sizes_

In [None]:
def stochastic_gradient_decreasing_step(tau0, gamma):

    Nepoch = 200
    ndata = 1000
    plot_period = 20

    CardNodes = (2, 4, 6, 4, 1)
    NN = ToyNN(card=CardNodes, coef_bounds=(-1, 1, -1, 1), chi="tanh", grid=(-1, 1, 41))

    cost_list = []
    
    for epoch in range(1, Nepoch + 1):
        ## Computation of the descent direction

        for k in range(ndata):
            j = np.random.randint(0, ndata)  # Selection of random data
            tau = gamma * tau0 / (gamma + (epoch-1)*ndata + k) # (epoch-1)*ndata + k is the current iteration number
            NN.init_vector()  # 0-initialization of the descent vectors
            Desc_W, Desc_Bias = NN.descent(
                X=DATA.X[j], y=DATA.Y[j], pb=pb, tau=tau)
            NN.add_vector_to_coefs(DW=Desc_W, DBias=Desc_Bias)

        #computation of the error (sum of test losses)
        if epoch % plot_period == 0:
            cost = NN.total_loss_and_prediction(DATA=TEST, pb=pb)
            title = "Epoch: " + str(epoch) + ", Cost: " + str(cost)
            print(title)
            NN.show_pred()
            TEST.show_class(pred="yes")
            pb.show_border('w--')
            plt.show()
            cost_list.append(cost)
        else:
            cost = NN.total_loss(DATA=TEST, pb=pb)
            title = "Epoch: " + str(epoch) + ", Cost: " + str(cost)
            print(title)
            cost_list.append(cost)
            
    NN.show()
    plt.figure()
    plt.plot(cost_list)
    plt.title("Cost evolution with the epochs (stochastic gradient with decreasing step)") 


In [None]:
stochastic_gradient_decreasing_step(0.1,10000)

## Question 4
On observe que l'erreur asymptotique n'est pas nécessairement meilleure, mais le bruit est grandement diminué avec la méthode du pas décroissant. 

Pour les choix des paramètres :
* On prend un $\tau^0$ plus grand que pour le gradient stochastique classique, puisque celui-ci va diminuer au cours des itérations.
* On augmente progressivement $\gamma$ jusqu'à trouver un bon compromis entre vitesse de convergence et diminution du bruit.

Après avoir appliqué cette méthode empirique, on obtient des résultats satisfaisants avec $\tau^0 = 0.1$ et $\gamma = 1000$.

## Question 5

In [None]:
pb = ToyPb(name = "ring", bounds = (-1,1))
pb.show_border()

ndata = 1000
DATA = nD_data(n = ndata, pb = pb)

ntest = 500
TEST = nD_data(n = ntest, pb = pb, init_pred='yes')

TEST.show_class()
pb.show_border('k--')
plt.legend(loc=1)
plt.show()

CardNodes = (2, 4, 6, 4, 1)
NN = ToyNN(card = CardNodes, coef_bounds=(-1,1,-1,1), chi="tanh", grid=(-1,1,41))
NN.show()

to = 1        # fixed step size of the gradient descent algorithm
Nepoch = 200    # Number of epochs 

plot_period = 20

NN = ToyNN(card = CardNodes, coef_bounds=(-1,1,-1,1), chi="tanh", grid=(-1,1,41))

cost  = NN.total_loss_and_prediction(pb=pb, DATA=TEST)
title = "Epoch: " + str(0) + ", Cost: " + str(cost)
print(title)
NN.show_pred()
TEST.show_class(pred="ok")
pb.show_border('w--')
plt.plot()



In [None]:
### Iterations (no sub_iterations)
for epoch in range(1, Nepoch + 1):
    ## Computation of the descent direction
    N=NN.N
    # 0-initialization of the descent vectors
    NN.init_vector()
    # computation and summation over the data of their contributions to the total descent   
    for j in range(ndata):
        Desc_W, Desc_Bias = NN.descent(X=DATA.X[j], y=DATA.Y[j], pb=pb, tau=to)
        NN.add_to_vector(Desc_W, Desc_Bias)
    NN.mult_vector(1/ndata)       # renormalization of the sum of descent vectors      
        # Update of the parameters
    NN.add_vector_to_coefs()


#computation of the error (sum of test losses)
    if epoch%plot_period==0:
        cost = NN.total_loss_and_prediction(DATA=TEST, pb = pb)
        title = "Epoch: " + str(epoch) + ", Cost: " + str(cost)
        print(title)
        NN.show_pred()
        TEST.show_class(pred="yes")
        pb.show_border('w--')
        plt.show()
    else:
        cost  = NN.total_loss(DATA=TEST, pb=pb)
        title = "Epoch: " + str(epoch) + ", Cost: " + str(cost)
        print(title)

NN.show()


### Observations
Il semble que la méthode de gradient full-batch soit incapable d'isoler l'anneau: lors du calcul du gradient sur l'ensemble des points, le gradient des points à l'intérieur et à l'extérieur de l'anneau "se compensent", ce qui rend son isolement impossible en observant le gradient global.

In [None]:
stochastic_gradient_decreasing_step(0.1,1000)

En revanche, avec la méthode de gradient stochastique, les prédictions convergent rapidement vers la solution. Le problème évoqué auparavant n'existe pas avec des gradients "locaux".