<a href="https://colab.research.google.com/github/AlbertoMontanelli/Machine-Learning/blob/class_unit/neural_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notation conventions
* **net** = $X \cdot W+b$, $\quad X$: input matrix, $\quad W$: weights matrix, $\quad b$: bias array;
* **number of examples** = $l$ ;
* **number of features** = $n$ ;
* **input_size** : for the layer $i$ -> $k_{i-1}$ : number of the units of the previous layer $i-1$ ;
* **outputz_size** : for the layer $i$ -> $k_{i}$ : number of the units of the current layer $i$;
* **output_value** : $o_i=f(net_i)$ for layer $i$, where $f$ is the activation function.
* **number of labels** = $d$ for each example -> $l \ \textrm{x}\ d$ matrix. \
dim(**labels**) = dim(**predictions**) = dim(**targets**).

### Input Layer $L_0$ with $k_0$ units :
* input_size = $n$;
* output_size = $k_0$;
* net = $X \cdot W +b$, $\quad X$ : $l \ \textrm{x} \ n$ matrix, $\quad W: n \ \textrm{x} \ k_0$ matrix, $\quad b = 1 \ \textrm{x} \ k_0$ array; \
$⇒$ net: $l \ \textrm{x} \ k_0$ matrix.

### Generic Layer $L_i$ with $k_i$ units :
* input_size = $k_{i-1}$ ;
* output_size = $k_i$ ;
* net = $X \cdot W+b$, $\quad X$ : $l \ \textrm{x} \ k_{i-1}$ matrix, $\quad W: k_{i-1} \ \textrm{x} \ k_i$ matrix, $\quad b = 1 \ \textrm{x} \ k_i$ array ; \
$⇒$ net : $l \ \textrm{x} \ k_i$ matrix .

### Online vs mini-batch version:
* online version: $l' = 1$ example;
* mini-batch version: $l' =$ number of examples in the mini-batch.

# Activation functions
Definition of the activation functions and their derivatives.
* **Hidden layers**:
  * **ReLU**: computationally efficient, resilient to vanishing gradient problem, suffers if net < 0;
  * **Leaky ReLU**: better than ReLU in case of convergence problems thanks to non null output for net < 0;
    * $0<$ alpha $<<1$: if alpha is too small, the gradient could be negligable;
  * **ELU**: same as Leaky ReLU, the best in term of performances but the worst in term of computational costs;
  * **tanh**: very useful if data are distributed around 0, but tends to saturate to -1 or +1 in other cases;
* **Output layer**:
  * **Regression problem**: Linear output;
  * **Binary classification**: Sigmoid, converts values to 0 or 1 via threshold while being differentiable in 0, unlike e.g. the sign function;

In [1]:
import numpy as np

# dobbiamo capire che funzioni di attivazione usare e quali derivate
# da iniziare a fare successivamente: cross validation, test vs training error, ricerca di iperparametri (grid search, n layer, n unit,
# learning rule), nr epochs/early stopping, tikhonov regularization, momentum, adaline e altre novelties

def sigmoid(net):
    return 1 / (1 + np.exp(-net))

def d_sigmoid(net):
    return np.exp(-net) / (1 + np.exp(-net))**2

def tanh(net):
    return np.tanh(net)

def d_tanh(net):
    return 1 - (np.tanh(net))**2

"""   DA RIVEDERE

def softmax(net):
    return np.exp(net) / np.sum(np.exp(net), axis = 1, keepdims=True)

def softmax_derivative(net):

    # batch_size is the number of the rows in the matrix net; current_neuron_size is the number of the columns
    batch_size, current_neuron_size = net.shape

    # initialization of Jacobian tensor: each example in the batch (batch_size) is the input to current_neuron_size neurons,
    # for each neuron we compute current_neuron_size derivatives with respect to the other neurons and itself. This results in a
    # batch_size x current_neuron_size x current_neuron_size tensor.
    jacobians = np.zeros((batch_size, current_neuron_size, current_neuron_size))

    for i in range(batch_size): # for each example i in the batch
        s = net[i].reshape(-1, 1)  # creation of a column vector of dimension current_neuron_size x 1, s contains all the features of
                                   # the example i
        jacobians[i] = np.diagflat(s) - np.dot(s, s.T)

    return jacobians
"""

def softplus(net):
    return np.log(1 + np.exp(net))

def d_softplus(net):
    return np.exp(net) / (1 + np.exp(net))

def linear(net):
    return net

def d_linear(net):
    return 1

def ReLU(net):
    return np.maximum(net, 0)

def d_ReLU(net):
    return 1 if(net>=0) else 0

def leaky_relu(net, alpha):
    return np.maximum(net, alpha*net)

def d_leaky_relu(net, alpha):
    return 1 if(net>=0) else alpha

def ELU(net):
    return net if(net>=0) else np.exp(net)-1

def d_ELU(net):
    return 1 if(net>=0) else np.exp(net)


# Loss/Error functions:
Definition of loss/error functions and their derivatives. \
For each derivative we omit a minus from the computation because it's included later in the computation of the learning rule:
* **mean_squared_error**;
* **mean_euclidian_error**;
* **huber_loss**: used when there are expected big and small errors due to outliers or noisy data.

In [2]:
def mean_squared_error(y_true, y_pred):
    return np.sum((y_true - y_pred)**2)

def d_mean_squared_error(y_true, y_pred):
    return 2 * (y_true - y_pred)  # we'd get a minus but it's included in the computation of the learning rule

def mean_euclidian_error(y_true, y_pred):
    return np.sqrt(np.sum((y_true - y_pred)**2))

def d_mean_euclidian_error(y_true, y_pred):
    return (y_true - y_pred) / np.sqrt(np.sum((y_true - y_pred)**2))  # we'd get a minus but it's included in the computation of the learning rule

def huber_loss(y_true, y_pred, delta):
    return 0.5 * (y_true - y_pred)**2 if(np.abs(y_true-y_pred)<=delta) else delta * np.abs(y_true - y_pred) - 0.5 * delta**2

def d_huber_loss(y_true, y_pred, delta):
    return y_true - y_pred if(np.abs(y_true-y_pred)<=delta) else delta * np.sign(y_true-y_pred)


#  class Layer
**Constructor parameters :**
 * input_size : $k_{i-1}$ ;
 * output_size : $k_i$ ;
 * activation_function ;
 * activation_derivative . \\

**Constructor attributes :**
* self.weights : $k_{i-1} \ \textrm{x} \ k_i$ matrix . \\
Initialized extracting randomly from a uniform distribution [-1/a, 1/a], where a = $\sqrt{k_{i-1}}$ ;
* self.biases : $1 \ \textrm{x} \ k_i$ array. Initialized to zeros;
* self.activation_function;
* self.activation_derivative .

**Methods :**
* forward_layer : allows to compute the output of the layer for a given input.
 * parameter :
   * input_array : matrix $X$ (see above for the case $L_0$ or $L_i$) .
 * attributes :
   * self.input : input_array ;
   * self.net : net matrix $X \cdot W + b$ (see above for the case $L_0$ or $L_i$) .
 * return -> output = $f(net)$, where $f$ is the activation function; $f(net)$ has the same dimensions of $net$.
* backward_layer : computes the gradient loss and updates the weights by the learning rule for the single layer.
 * parameters :
   * d_Ep : target_value $-$ output_value, element by element: $l \ \textrm{x} \ d$ matrix.
   * learning_rate.
 * return -> sum_delta_weights $= \delta \cdot W^T$

In [3]:
class Layer:


    def __init__(self, input_size, output_size, activation_function, activation_derivative):
        self.weights = np.random.uniform(low=-1/np.sqrt(input_size), high=1/np.sqrt(input_size), size=(input_size, output_size))
        self.biases = np.zeros((1, output_size))
        self.activation_function = activation_function
        self.activation_derivative = activation_derivative


    def forward_layer(self, input_array):
        self.input = input_array
        self.net = np.dot(self.input, self.weights) + self.biases
        output = self.activation_function(self.net)
        return output


    def backward_layer(self, d_Ep, learning_rate):
        delta = d_Ep * self.activation_derivative(self.net) # loss gradient
        self.weights += learning_rate * np.dot(self.input.T, delta) # learning rule for the weights
        self.biases += learning_rate * np.sum(delta, axis = 0, keepdims = True) # learning rule for the biases
        sum_delta_weights = np.dot(delta, self.weights.T) # loss gradient for hidden layer
        return sum_delta_weights


# class NeuralNetwork
**Constructor attributes**:
 * self.layers: an empty list that will contain the layers.

**Methods**:
 * data_split: splits the input data into training set, validation set and test set
  * parameter:
    * x_tot: total data given as input;
    * K: number K of K-folds used in K-folds cross validation;
    * step_cycle: number of the step in the permutation cycle of the K-fold validation process.
  * attributes:
    * self.x_train: training set;
    * self.x_val: validation set;
    * self.x_test: test set.
 * add_layer: appends a layer to the empty list self.layers
  * parameter:
    * layer: the layer appended to the list self.layers.

* forward: iterates the layer.forward_layer method through each layer in the list self.layers
 * parameter:
   * input: $X$ matrix for layer $L_0$, $o_{i-1}$ for layer $L_i$.
 * return -> input = $o_i$ for layer $L_i$.
* backward: iterates from the last layer to the first layer the layer.backward_layer method, thus updating the weights and the biases for each layer.
 * parameter:
   * d_Ep;
   * learning_rate.

* train_online: applies the forward and backward method to the network for a specified number of epochs **one example at a time**.
 * parameter:
   * x_train: input matrix $X$;
   * target: $l \ \textrm{x} \ d$ matrix;
   * epochs: number of the iterations of the training algorithm;
   * learning_rate;
   * loss_function;
   * loss_function_derivative.

* train_minibatch: applies the forward and backward method to the network for a specified number of epochs **to batches of $l' < l$** examples.
 * parameter:
    * x_train: input matrix $X$;
    * target: $l \ \textrm{x} \ d$ matrix;
    * epochs: number of the iterations of the training algorithm;
    * learning_rate;
    * loss_function;
    * loss_function_derivative;
    * batch_size.


In [14]:
class NeuralNetwork:

    def __init__(self):
        self.layers = []


    def data_split(self, x_tot, k, step_cycle):
        '''# randomization of the input matrix
        indices = np.arange(num_samples) # creates an array from 0 to num_samples - 1
        np.random.shuffle(indices) # shuffling the indices
        x_tot = x_tot[indices] # re-ordering of the rows according to the new indices'''

        # splitting of the data batch into x_train_val and x_test
        num_samples = x_tot.shape[0]
        x_train_val = x_tot[:int(0.8 * num_samples)] # training set and validation set make up 80% of the original data set
        self.x_test = x_tot[int(0.8 * num_samples):] # test set makes up 20% of the original data set

        # splitting of the x_train_val batch into x_train and x_val
        if k == 1: # hold-out cross validation
            self.x_train = x_train_val[:int(0.75 * x_train_val.shape[0])] # training set makes up 60% of the original data set
            self.x_val = x_train_val[int(0.75 * x_train_val.shape[0]):] # validation set makes up 20% of the original data
        else: # k-fold cross validation
            fold_size = int(x_train_val.shape[0] / k) # number of rows per fold
            self.x_train = np.concatenate([x_train_val[:step_cycle*fold_size], self.x_train_val[(step_cycle+1)*fold_size:]]) # training set
            self.x_val = x_train_val[step_cycle*fold_size:(step_cycle+1)*fold_size] # validation set


    def add_layer(self, layer):
        self.layers.append(layer)


    def forward(self, input):
        for layer in self.layers:
            input = layer.forward_layer(input)
        return input


    def backward(self, d_Ep, learning_rate):
        for layer in reversed(self.layers):
            d_Ep = layer.backward_layer(d_Ep, learning_rate)


    def train_online(self, x_train, target, epochs, learning_rate, loss_function, loss_function_derivative):
        for epoch in range(epochs):
          epoch_loss = 0

          for x_train_row, target_row in zip(x_train, target):

            x_train_row = x_train_row.reshape(1, -1)
            target_row = target_row.reshape(1, -1)

            # Forward propagation
            predictions = self.forward(x_train_row) # predictions = output of the output layer

            # Compute loss and loss gradient for backward function
            loss = loss_function(target_row, predictions)
            loss_gradient = loss_function_derivative(target_row, predictions)
            epoch_loss += loss  # accumulates the losses for each example

            # Backward propagation
            self.backward(loss_gradient, learning_rate)

          # computation of the average loss per epoch
          average_epoch_loss = epoch_loss / len(x_train)
          print(f"ONLINE: epoch #{epoch}, Average Loss: {average_epoch_loss}")


    def train_minibatch(self, x_train, target, epochs, learning_rate, loss_function, loss_function_derivative, batch_size):
        num_samples = x_train.shape[0] # selection of the number of rows

        for epoch in range(epochs):
          epoch_loss = 0

          # the rows of the input matrix are randomized in order to have different examples in each mini-batch for each epoch
          indices = np.arange(num_samples) # creates an array from 0 to num_samples - 1
          np.random.shuffle(indices) # shuffling the indices
          x_train = x_train[indices] # re-ordering of the rows according to the new indices
          target = target[indices] # same but for the targets

          # process data in batches
          for i in range(0, num_samples, batch_size): # even if the last mini-batch does not have size equal to batch_size it is processed anyway
            x_batch = x_train[i:i+batch_size]
            target_batch = target[i:i+batch_size]


           # Forward propagation
            predictions = self.forward(x_batch) # predictions = output of the output layer

            # Compute loss and loss gradient for backward function
            loss = loss_function(target_batch, predictions)
            loss_gradient = loss_function_derivative(target_batch, predictions)
            epoch_loss += np.sum(loss)  # accumulates the loss for all the examples in the mini-batch for each mini-batch

            # Backward propagation
            self.backward(loss_gradient, learning_rate)

          # computation of the average loss per epoch
          average_epoch_loss = epoch_loss / num_samples
          print(f"MINIBATCH: epoch #{epoch}, Average Loss: {average_epoch_loss}")


#Unit Test

In [15]:
#test
np.random.seed(42)

x = np.random.rand(20, 3)
target = np.random.rand(20, 2)

layer_one = Layer(3, 2, linear, d_linear)
layer_two = Layer(2, 2, linear, d_linear)

NN = NeuralNetwork()
NN.data_split(x, 4, 1)
print(f'validation set {NN.x_val}')
print(f'training set {NN.x_train}')
print(f'training + validation {NN.x_train_val}')

validation set [[0.03438852 0.9093204  0.25877998]
 [0.54671028 0.18485446 0.96958463]
 [0.37454012 0.95071431 0.73199394]
 [0.60754485 0.17052412 0.06505159]]
training set [[0.19598286 0.04522729 0.32533033]
 [0.05808361 0.86617615 0.60111501]
 [0.13949386 0.29214465 0.36636184]
 [0.66252228 0.31171108 0.52006802]
 [0.43194502 0.29122914 0.61185289]
 [0.77513282 0.93949894 0.89482735]
 [0.70807258 0.02058449 0.96990985]
 [0.94888554 0.96563203 0.80839735]
 [0.18340451 0.30424224 0.52475643]
 [0.83244264 0.21233911 0.18182497]
 [0.45606998 0.78517596 0.19967378]
 [0.44015249 0.12203823 0.49517691]]
training + validation [[0.19598286 0.04522729 0.32533033]
 [0.05808361 0.86617615 0.60111501]
 [0.13949386 0.29214465 0.36636184]
 [0.66252228 0.31171108 0.52006802]
 [0.03438852 0.9093204  0.25877998]
 [0.54671028 0.18485446 0.96958463]
 [0.37454012 0.95071431 0.73199394]
 [0.60754485 0.17052412 0.06505159]
 [0.43194502 0.29122914 0.61185289]
 [0.77513282 0.93949894 0.89482735]
 [0.70807258

In [6]:
#test
np.random.seed(42)

x = np.random.rand(1000, 3)
target = np.random.rand(1000, 2)


layer_one1 = Layer(3, 2, linear, d_linear)
layer_one2 = Layer(3, 2, linear, d_linear)
layer_two1 = Layer(2, 2, linear, d_linear)
layer_two2 = Layer(2, 2, linear, d_linear)

NN1 = NeuralNetwork()
NN1.add_layer(layer_one1)
NN1.add_layer(layer_two1)
NN2 = NeuralNetwork()
NN2.add_layer(layer_one2)
NN2.add_layer(layer_two2)
NN1.train_minibatch(x, target, 10, 0.01, mean_squared_error, d_mean_squared_error, 3)
NN2.train_online(x, target, 10, 0.01, mean_squared_error, d_mean_squared_error)

MINIBATCH: epoch #0, Average Loss: 0.1812626781214725
MINIBATCH: epoch #1, Average Loss: 0.1697308739326982
MINIBATCH: epoch #2, Average Loss: 0.16813324213355693
MINIBATCH: epoch #3, Average Loss: 0.167909063796055
MINIBATCH: epoch #4, Average Loss: 0.16769058338936446
MINIBATCH: epoch #5, Average Loss: 0.16748801198416394
MINIBATCH: epoch #6, Average Loss: 0.16766824920313117
MINIBATCH: epoch #7, Average Loss: 0.1675805069710757
MINIBATCH: epoch #8, Average Loss: 0.1689332367348612
MINIBATCH: epoch #9, Average Loss: 0.1685914707561085
ONLINE: epoch #0, Average Loss: 0.18308535767693287
ONLINE: epoch #1, Average Loss: 0.170414573400587
ONLINE: epoch #2, Average Loss: 0.16823397283664485
ONLINE: epoch #3, Average Loss: 0.1674839462665076
ONLINE: epoch #4, Average Loss: 0.16717077679125641
ONLINE: epoch #5, Average Loss: 0.1670230640723701
ONLINE: epoch #6, Average Loss: 0.1669464292220957
ONLINE: epoch #7, Average Loss: 0.1669038083643578
ONLINE: epoch #8, Average Loss: 0.1668794691095