# Backpropagation Neural Networks - Learning and Implementation strategies

In this notebook, we will study the use of `PyTorch` and `Tensorflow` frameworks for implementing and training Neural Networks. This is not intended to be exhaustive, but rather to provide examples for exploring the algorithms and their hyperparameters with these frameworks.


### Imports

In [1]:
import numpy as np

In [2]:
import tensorflow as tf
from tensorflow import keras

In [3]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.optimizers import SGD, RMSprop, Adam

In [4]:
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from mlxtend.plotting import plot_decision_regions
from tqdm import tqdm

In [5]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

### Global settings

In [6]:
seed = 1
device = 'cpu'

### XOR (exclusive OR) data simulation

Let's run our examples for a very simple non-linear binary classification example.

In [7]:
# Data simulation
N_data = 200
train_size = 150
X = 2 * torch.rand(N_data, 2, device=device, dtype=torch.float32) - 1
y = torch.tensor([0 if elem[0]*elem[1] < 0 else 1 for elem in X], device=device, dtype=torch.float32)

# Split training and test partitions
X_train = X[:train_size, :]
y_train = y[:train_size]
X_val = X[train_size:, :]
y_val = y[train_size:]

# Define datasets for data loaders
train_ds = TensorDataset(X_train, y_train)
val_ds = TensorDataset(X_val, y_val)

# Create the data loaders
batch_size_GD = train_size
train_dl_GD = DataLoader(train_ds, batch_size_GD, shuffle=True)
val_dl_GD = DataLoader(val_ds, batch_size_GD, shuffle=True)

batch_size_SGD = 1
train_dl_SGD = DataLoader(train_ds, batch_size_SGD, shuffle=True)
val_dl_SGD = DataLoader(val_ds, batch_size_SGD, shuffle=True)

batch_size_MiniSGD = 32
train_dl_MniSGD = DataLoader(train_ds, batch_size_MiniSGD, shuffle=True)
val_dl_MiniSGD = DataLoader(val_ds, batch_size_MiniSGD, shuffle=True)

print('Partition sizes')
print([X_train.shape, y_train.shape], [X_train.dtype, y_train.dtype], [X_train.device, y_train.device])
print([X_val.shape, y_val.shape], [X_val.dtype, y_val.dtype], [X_val.device, y_val.device])

print('\nBatch sizes')
print(batch_size_GD)
print(batch_size_SGD)
print(batch_size_MiniSGD)

Partition sizes
[torch.Size([150, 2]), torch.Size([150])] [torch.float32, torch.float32] [device(type='cpu'), device(type='cpu')]
[torch.Size([50, 2]), torch.Size([50])] [torch.float32, torch.float32] [device(type='cpu'), device(type='cpu')]

Batch sizes
150
1
32


In [8]:

X_np = X.cpu().numpy()
y_np = y.cpu().numpy()

fig = go.Figure()

# Class 0
fig.add_trace(go.Scatter(
    x=X_np[y_np == 0, 0],
    y=X_np[y_np == 0, 1],
    mode='markers',
    marker=dict(color='blue', size=9, opacity=0.35),
    name='Class 0'
))

# Class 1
fig.add_trace(go.Scatter(
    x=X_np[y_np == 1, 0],
    y=X_np[y_np == 1, 1],
    mode='markers',
    marker=dict(color='red', size=9, opacity=0.35),
    name='Class 1'
))

fig.add_shape(type='line', x0=0, x1=0, y0=-1.75, y1=1.75, line=dict(color='black', dash='dot'))
fig.add_shape(type='line', x0=-1.75, x1=1.75, y0=0, y1=0, line=dict(color='black', dash='dot'))
fig.update_layout(
    width=550,
    height=400,
    xaxis=dict(range=[-1.25, 1.25], title=r'$x_1$'),
    yaxis=dict(range=[-1.25, 1.25], title=r'$x_2$'),
    legend=dict(x=1, y=1),
    margin=dict(l=20, r=20, t=20, b=20)
)

fig.show()


<br />
<hr />

## 1. Building a Network with PyTorch

Check the <a href='https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module'>PyTorch documentation</a> for details.

To build a Neural Network in PyTorch, we create new Classes that inherit from the `nn.Module` class.

This, some functionalities are already implemented. However, the model definitions have to be made mainly by hand. As a general rule, following the setps:

1. Define the architecture of the network;
2. Initialize the weights and biases of the network;
3. Define the forward pass of the network;
4. Define the training loop behaviour;
5. Create the NN instance;
6. Define the loss function;
7. Define the optimizer.


### 1. 1. Network architecture

Let's create a NN with three **linear layers**:

1. The first will upscale the 2 input layer nodes (2 features in the dataset) to 3 nodes.
2. The second will upscale to 4 nodes.
3. The third (output) will, then, downscale to 1 node.

Besides that, the **activation functions** should be:

1. The ReLU function in the hidden layer, and
2. The sigmoid function in the output layer.

```python
class Net(nn.Module):
    def __init__(self):
        # Instantiate the parent (nn.Module) class
        super(Net, self).__init__()
    
        # NN architecture definition
        ...

```
<br />

### 1.2. Weights and bias initialization

For the weights and bias initialization, let's use methods that are already available in the PyTorch library. However, notice that you can easily implement your own initialization methods.

- Weights: let's use the Xavier Uniform Initialization method. It maintains the variance of the activations remain the same across the layers of the network.
- Bias: bias will be initializes with zeros.

Take a look at the PyTorch docs for more available initialization methods:
<a href='https://pytorch.org/docs/stable/nn.init.html'>https://pytorch.org/docs/stable/nn.init.html</a>.

```python
torch.nn.init.xavier_uniform_(attribute.weight)
torch.nn.init.zeros_(attribute.bias)

```
<br />

### 1. 3. Forward pass

This step sets how the linear combination of the inputs and weights of each layer should work and how the combination of the linear step should be combined with the activation functions:

```python
def forward(self, x):
    # For each layer, the output will be the ReLu activation applied to the output of the linear operation
    x = self.activation(self.fc1(x))
    # For the last layer, the sigmoid function will be the activation
    x = torch.sigmoid(self.fc2(x))
    return x
```
<br />

### 1. 4. Training loop

Having defined the Neural Network topology, the initialization method, and the feedforward pass, the behavior of the backpropagation should be set in the training loop.

For that, Pytorch has some useful methods:

- The `backward()` method calculates the derivative of the Error in respect to the NN weights applying the chain rule for hidden neurons;
- The `step()` method updates the weights and bias based on the computed gradients.

<br />

Let's put steps 1 to 4 altogether:

<br />

In [9]:
class Net(nn.Module):
    
    def __init__(self):
        
        super(Net, self).__init__()
        #
        # 1. 1. Network architecture
        #
        # Layer 1 (input to 3 nodes)
        self.fc1 = nn.Linear(2, 3)
        # Layer 2 (3 to 4 nodes)
        self.fc2 = nn.Linear(3, 4)
        # Layer 2 (4 nodes to 1 output node for binary classification)
        self.fc3 = nn.Linear(4, 1)
        # Hidden layers activation function
        self.activation = nn.ReLU()
        # Weights initialisation
        # The apply method applies the function passed as the apply() argument
        # to each element in the object, that in this case is the neural network.
        self.apply(self._init_weights)
        
    #
    # 1. 2. Weights and bias initialization
    #
    def _init_weights(self, attribute):
        if isinstance(attribute, nn.Linear):
          torch.nn.init.xavier_uniform_(attribute.weight)
          torch.nn.init.zeros_(attribute.bias)
    #
    # 1. 3. Forward pass
    #
    def forward(self, x):
        # For each layer, the output will be the ReLu activation applied to the output of the linear operation
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        # For the last layer, the sigmoid function will be the activation
        x = torch.sigmoid(self.fc3(x))
        return x

    #
    # 1. 4. Training loop
    # For details, see Machine Learning with PyTorch and Scikit-Learn.
    #
    def train(self, num_epochs, loss_fn, optimizer, train_dl, train_size, batch_size, x_valid, y_valid):
        # Initialize weights
        self.apply(self._init_weights)
    
        # Loss and accuracy history objects initialization
        loss_hist_train = [0] * num_epochs
        accuracy_hist_train = [0] * num_epochs
        loss_hist_valid = [0] * num_epochs
        accuracy_hist_valid = [0] * num_epochs
        
        # Learning loop
        for epoch in tqdm(range(num_epochs)):
            # Batch learn
            for x_batch, y_batch in train_dl:
                # ---
                # 1.4.1. Get the predictions, the [:,0] reshapes from (batch_size,1) to (batch_size)
                pred = self(x_batch)[:,0]
                # 1.4.2. Compute the loss
                loss = loss_fn(pred, y_batch)
                # 1.4.3. Back propagate the gradients
                # The `backward()` method, already available in PyTroch, calculates the 
                # derivative of the Error in respect to the NN weights
                # applying the chain rule for hidden neurons.
                loss.backward()
                # 1.4.4. Update the weights based on the computed gradients
                optimizer.step()
                # ---
                
                # Reset to zero the gradients so they will not accumulate over the mini-batches
                optimizer.zero_grad()
                
                # Update performance metrics
                loss_hist_train[epoch] += loss.item()
                is_correct = ((pred>=0.5).float() == y_batch).float()
                accuracy_hist_train[epoch] += is_correct.mean()
                
            # Average the results
            loss_hist_train[epoch] /= train_size/batch_size
            accuracy_hist_train[epoch] /= train_size/batch_size
            
            # Predict the validation set
            pred = self(x_valid)[:, 0]
            loss_hist_valid[epoch] = loss_fn(pred, y_valid).item()
            is_correct = ((pred>=0.5).float() == y_valid).float()
            accuracy_hist_valid[epoch] += is_correct.mean()
            
        return loss_hist_train, loss_hist_valid, accuracy_hist_train, accuracy_hist_valid

    # Not needed normaly, it is just for mlextend plots
    def predict(self, x):
        x = torch.tensor(x, dtype=torch.float32)
        pred = self.forward(x)[:, 0]
        return (pred>=0.5).float()
        

In [None]:
class Net_group_Y(nn.Module):
    
    def __init__(self, input_size=2, output_size=1,
                 hidden_layer_sizes=[3,4,5], activation=nn.ReLU()):
        
        super(Net_group_Y, self).__init__()
        #
        # 1. 1. Network architecture
        
        self.add_module(f'fc{1}', nn.Linear(input_size, hidden_layer_sizes[0]))
        
        for i in range(1,len(hidden_layer_sizes)):
            self.add_module(f'fc{i+1}', nn.Linear(hidden_layer_sizes[i-1], hidden_layer_sizes[i]))
        
        self.add_module(f'fc{len(hidden_layer_sizes) +1 }', nn.Linear(hidden_layer_sizes[-1], output_size))

        # Weights initialisation
        # The apply method applies the function passed as the apply() argument
        # to each element in the object, that in this case is the neural network.
        self.apply(self._init_weights)
        # Store the parameters
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_layer_sizes = hidden_layer_sizes
        self.activation = activation
        
        
    #
    # 1. 2. Weights and bias initialization
    #
    def _init_weights(self, attribute):
        if isinstance(attribute, nn.Linear):
          torch.nn.init.xavier_uniform_(attribute.weight)
          torch.nn.init.zeros_(attribute.bias)
    #
    # 1. 3. Forward pass
    """    
    def forward(self, x):
        # For each layer, the output will be the ReLu activation applied to the output of the linear operation
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        # For the last layer, the sigmoid function will be the activation
        x = torch.sigmoid(self.fc3(x))
        return x"""
    
    def forward(self, x):
        # Forward pass through all layers
        for i in range(1, len(self.hidden_layer_sizes) + 2):
            #print(f'forward pass layer {i}')
            layer = getattr(self, f'fc{i}')
            x = layer(x)
            if i < len(self.hidden_layer_sizes):
                x = self.activation(x)
        # Apply sigmoid activation to the output layer
        x = torch.sigmoid(x)
        return x
    
    #
    # 1. 4. Training loop
    # For details, see Machine Learning with PyTorch and Scikit-Learn.
    #
    def train(self, num_epochs, loss_fn, optimizer, train_dl, train_size, batch_size, x_valid, y_valid):
        # Initialize weights
        self.apply(self._init_weights)
    
        # Loss and accuracy history objects initialization
        loss_hist_train = [0] * num_epochs
        accuracy_hist_train = [0] * num_epochs
        loss_hist_valid = [0] * num_epochs
        accuracy_hist_valid = [0] * num_epochs
        
        # Learning loop
        for epoch in tqdm(range(num_epochs)):
            # Batch learn
            for x_batch, y_batch in train_dl:
                #print('*'*20)
                # ---
                # 1.4.1. Get the predictions, the [:,0] reshapes from (batch_size,1) to (batch_size)
                pred = self(x_batch)[:,0]
                # 1.4.2. Compute the loss
                loss = loss_fn(pred, y_batch)
                # 1.4.3. Back propagate the gradients
                # The `backward()` method, already available in PyTroch, calculates the 
                # derivative of the Error in respect to the NN weights
                # applying the chain rule for hidden neurons.
                loss.backward()
                # 1.4.4. Update the weights based on the computed gradients
                optimizer.step()
                # ---
                
                # Reset to zero the gradients so they will not accumulate over the mini-batches
                optimizer.zero_grad()
                
                # Update performance metrics
                loss_hist_train[epoch] += loss.item()
                is_correct = ((pred>=0.5).float() == y_batch).float()
                accuracy_hist_train[epoch] += is_correct.mean()
                
            # Average the results
            loss_hist_train[epoch] /= train_size/batch_size
            accuracy_hist_train[epoch] /= train_size/batch_size
            
            # Predict the validation set
            pred = self(x_valid)[:, 0]
            loss_hist_valid[epoch] = loss_fn(pred, y_valid).item()
            is_correct = ((pred>=0.5).float() == y_valid).float()
            accuracy_hist_valid[epoch] += is_correct.mean()
            
        return loss_hist_train, loss_hist_valid, accuracy_hist_train, accuracy_hist_valid

    # Not needed normaly, it is just for mlextend plots
    def predict(self, x):
        print(f'predict with input shape: {x.shape}')
        x = torch.tensor(x, dtype=torch.float32)
        pred = self.forward(x)[:, 0]
        print(f'finished predict with output shape: {pred.shape}')
        return (pred>=0.5).float()
        

In [None]:
from typing import List

exal_NEt_structre_hyperparameters = [3,6,8,5]

frist_instance = Net_group_Y(input_size=2, output_size=1, hidden_layer_sizes=exal_NEt_structre_hyperparameters)
frist_instance.to(device)
loss_fn_instance = nn.BCELoss()
learning_rate_parameter = 0.01
optimizer_name = 'Adam'

optimizer_choiche = {
    'GD': torch.optim.SGD(frist_instance.parameters(), lr=learning_rate_parameter),
    'SGD': torch.optim.SGD(frist_instance.parameters(), lr=learning_rate_parameter),
    'MiniSGD': torch.optim.SGD(frist_instance.parameters(), lr=learning_rate_parameter),
    'ASGD': torch.optim.ASGD(frist_instance.parameters(), lr=learning_rate_parameter),
    
    'RMSprop': torch.optim.RMSprop(frist_instance.parameters(), lr=learning_rate_parameter),
    'Adam': torch.optim.Adam(frist_instance.parameters(), lr=learning_rate_parameter)
}

num_epochs = 2



history = frist_instance.train(
    loss_fn=loss_fn_instance, 
    optimizer=optimizer_choiche[optimizer_name], 
    num_epochs=num_epochs, 
    train_dl=train_dl_GD, 
    train_size=train_size, 
    batch_size=batch_size_GD,
    x_valid=X_val, y_valid=y_val
)


100%|██████████| 2/2 [00:00<00:00, 230.61it/s]

********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5





In [39]:
def train_model(X_train,X_val,y_train,y_val,
                model, num_epochs, loss_fn, optimizer_name, batch_size, learning_rate):
    """
    Train the model with the given parameters.
    
    Parameters:
    - model: The neural network model to train.
    - num_epochs: Number of epochs to train the model.
    - loss_fn: Loss function to use for training.
    - optimizer: Optimizer to use for training.
    - batch_size: Size of each batch during training.
    - x_valid: Validation input data.
    - y_valid: Validation target data.
    - learning_rate: Learning rate for the optimizer.
    
    Returns:
    - history: Training history containing loss and accuracy metrics.
    """

    # Define datasets for data loaders
    train_ds = TensorDataset(X_train, y_train)
    val_ds = TensorDataset(X_val, y_val)
    train_size = len(train_ds)
    if optimizer_name == 'GD':
        batch_size = train_size
        train_dl = DataLoader(train_ds, batch_size, shuffle=True)
        val_dl = DataLoader(val_ds, batch_size, shuffle=True)
    
    elif optimizer_name == 'SGD':
        batch_size = 1
        train_dl = DataLoader(train_ds, batch_size, shuffle=True)
        val_dl = DataLoader(val_ds, batch_size, shuffle=True)
    else:
        batch_size = batch_size
        train_dl = DataLoader(train_ds, batch_size, shuffle=True)
        val_dl = DataLoader(val_ds, batch_size, shuffle=True)
    
    optimizer_choiche = {
    'GD': torch.optim.SGD(model.parameters(), lr=learning_rate),
    'SGD': torch.optim.SGD(model.parameters(), lr=learning_rate),
    'MiniSGD': torch.optim.SGD(model.parameters(), lr=learning_rate),
    'ASGD': torch.optim.ASGD(model.parameters(), lr=learning_rate),
    
    'RMSprop': torch.optim.RMSprop(model.parameters(), lr=learning_rate),
    'Adam': torch.optim.Adam(model.parameters(), lr=learning_rate)
    }
    optimizer_instance = optimizer_choiche[optimizer_name]
    
    return model.train(
        num_epochs=num_epochs, 
        loss_fn=loss_fn, 
        optimizer=optimizer_instance, 
        train_dl=train_dl, 
        train_size=train_size, 
        batch_size=batch_size,
        x_valid=train_ds.tensors[0],
        y_valid=train_ds.tensors[1])
    
# Example usage
history = train_model(
    X_train=X_train, 
    X_val=X_val, 
    y_train=y_train, 
    y_val=y_val,
    model=frist_instance,
    num_epochs=10,
    loss_fn=loss_fn_instance,
    optimizer_name="Adam",
    batch_size=10,
    learning_rate=learning_rate_parameter
)

100%|██████████| 10/10 [00:00<00:00, 96.15it/s]

********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass layer 5
********************
forward pass layer 1
forward pass layer 2
forward pass layer 3
forward pass layer 4
forward pass 




In [38]:
history

([0.6930275122324626,
  0.672054918607076,
  0.6303154667218526,
  0.5575008710225423,
  0.45660227139790854,
  0.3482097715139389,
  0.2670164187749227,
  0.2445397838950157,
  0.2023410402238369,
  0.15560577996075153],
 [0.6811821460723877,
  0.6533634662628174,
  0.5925976634025574,
  0.499734491109848,
  0.3825667202472687,
  0.2644290626049042,
  0.22350843250751495,
  0.2099994719028473,
  0.15125267207622528,
  0.1303766816854477],
 [tensor(0.4667),
  tensor(0.6400),
  tensor(0.6933),
  tensor(0.7000),
  tensor(0.7667),
  tensor(0.8867),
  tensor(0.9000),
  tensor(0.9333),
  tensor(0.9067),
  tensor(0.9400)],
 [tensor(0.5733),
  tensor(0.6733),
  tensor(0.7000),
  tensor(0.7333),
  tensor(0.8867),
  tensor(0.8933),
  tensor(0.9200),
  tensor(0.9200),
  tensor(0.9333),
  tensor(0.9467)])

### 1.5. NN instances

In [None]:
# Instantiate the NNs
nn_names = ['GD', 'SGD', 'MiniSGD', 'ASGD', 'RMSprop', 'Adam']
nn_torch = {}
for k in nn_names:
    nn_torch.update({
        k: Net().to(device)
    })


The instantiation method have initialized the NN weights and bias:

In [None]:
nn_torch['GD'].fc1.weight

In [None]:
nn_torch['GD'].fc1.bias

### 1. 6. Loss

For XOR problem, let's use the standard Binary Crossentropy Loss. 
Take a look at the PyTorch docs for more available loss functions:
<a href='https://pytorch.org/docs/stable/nn.html#loss-functions'>https://pytorch.org/docs/stable/nn.html#loss-functions</a>.


In [None]:
loss_fn = nn.BCELoss()

### 1. 7. Optimizer

In this notebook, let's comapre the performance of the following optimizers:

- Gradient Descent (GD)
- Stochastic Gadient Descent (SGD)
- Mini-batch Gadient Descent (MiniSGD)
- Averaged Stochastic Gradient Descent (ASGD)
- Root Mean Square Propagation (RMSprop)

Notice that the GD, SGD, and MiniSGD use the same optimization algorithm. What will be different is the batch size used for each learning loop iteration. They are available in PyTorch. Take a look at the PyTorch docs for more available optimization algorithms and its hyperparameters:
<a href='https://pytorch.org/docs/stable/optim.html#module-torch.optim'>https://pytorch.org/docs/stable/optim.html#module-torch.optim</a>.

```python
torch.optim.SGD(nn.parameters())
torch.optim.ASGD(nn.parameters())
torch.optim.RMSprop(nn.parameters())
torch.optim.Adam(nn.parameters())
```

<br />


In [None]:
# Learning rate
learning_rate = .05

# SGD optmizer
optimizer = {
    'GD': torch.optim.SGD(nn_torch['GD'].parameters(), lr=learning_rate),
    'SGD': torch.optim.SGD(nn_torch['SGD'].parameters(), lr=learning_rate),
    'MiniSGD': torch.optim.SGD(nn_torch['MiniSGD'].parameters(), lr=learning_rate),
    'ASGD': torch.optim.ASGD(nn_torch['ASGD'].parameters(), lr=learning_rate),
    
    'RMSprop': torch.optim.RMSprop(nn_torch['RMSprop'].parameters(), lr=learning_rate),
    'Adam': torch.optim.Adam(nn_torch['Adam'].parameters(), lr=learning_rate)
}


### 1.8. NN Fit 

In [None]:
num_epochs = 500

history_torch = {'GD': nn_torch['GD'].train(
    loss_fn=loss_fn, 
    optimizer=optimizer['GD'], 
    num_epochs=num_epochs, 
    train_dl=train_dl_GD, 
    train_size=train_size, 
    batch_size=batch_size_GD,
    x_valid=X_val, y_valid=y_val
)}
print('GD train finished\n')

history_torch.update({'SGD': nn_torch['SGD'].train(
    loss_fn=loss_fn, 
    optimizer=optimizer['SGD'], 
    num_epochs=num_epochs, 
    train_dl=train_dl_SGD, 
    train_size=train_size, 
    batch_size=batch_size_SGD,
    x_valid=X_val, y_valid=y_val
)})
print('SGD train finished\n')

for dl in list(optimizer.keys())[2:]:
    history_torch.update({dl: nn_torch[dl].train(
        loss_fn=loss_fn, 
        optimizer=optimizer[dl], 
        num_epochs=num_epochs, 
        train_dl=train_dl_MniSGD, 
        train_size=train_size, 
        batch_size=batch_size_MiniSGD,
        x_valid=X_val, y_valid=y_val
    )})
    print(dl+' train finished\n')


### 1.9. NN history

In [None]:
n_optimizers = len(optimizer)
fig = make_subplots(
    rows=n_optimizers,
    cols=2,
    subplot_titles=[
        f'PyTorch {opt_name} - Loss' if i % 2 == 0 else f'Pytorch {opt_name} - Accuracy'
        for opt_name in optimizer.keys() for i in range(2)
    ],
    vertical_spacing=0.08,
    horizontal_spacing=0.1
)

for row_i, (opt_name, _) in enumerate(optimizer.items(), start=1):
    hist = history_torch[opt_name]

    fig.add_trace(go.Scatter(
        y=hist[0], mode='lines', name='Train Loss',
        line=dict(color='blue'),
        showlegend=(row_i == 1)
    ), row=row_i, col=1)
    fig.add_trace(go.Scatter(
        y=hist[1], mode='lines', name='Validation Loss',
        line=dict(color='orange'),
        showlegend=(row_i == 1)
    ), row=row_i, col=1)

    fig.add_trace(go.Scatter(
        y=[v.cpu().item() for v in hist[2]], mode='lines', name='Train Accuracy',
        line=dict(color='forestgreen'),
        showlegend=(row_i == 1)
    ), row=row_i, col=2)
    fig.add_trace(go.Scatter(
        y=[v.cpu().item() for v in hist[3]], mode='lines', name='Validation Accuracy',
        line=dict(color='orangered'),
        showlegend=(row_i == 1)
    ), row=row_i, col=2)

    fig.update_yaxes(range=[0, 1.01], row=row_i, col=2)

fig.update_layout(
    height=300 * n_optimizers,
    width=800,
    legend=dict(orientation='h', yanchor='bottom', y=-0.05, xanchor='center', x=0.5),
    margin=dict(t=30, b=0)
)

fig.show()


### 1. 10. NN Decision Regions

In [None]:
plt.figure(figsize=(12, 8))
n_rows = 2
n_cols = 3
for nn_i, nn_name in enumerate(nn_torch.keys()):
    plt.subplot(n_rows, n_cols, nn_i + 1)
    plot_decision_regions(X=X_val.cpu().numpy(), 
                      y=y_val.cpu().int().numpy(),
                      clf=nn_torch[nn_name].to('cpu'))
    plt.title(nn_name+' PyTorch NN\nDecision Regions', size=8)
plt.show()


> *Play with different NN architectures, learning rates, momentum, number of iterations (epochs) etc*.

<br />

<br />
<hr />

## 2. Building a Network with TensorFlow

Check the <a href='https://www.tensorflow.org/guide'>TensorFlow documentation</a> for details.

Neural Network in Tensorflow have a modules structure, and it can be summarised in only two steps:

1. Create the NN instance;
2. Define the architecture of the network, with the number of neurons, weights initializers (optional), and activation functions of each layer;
3. Define the loss function. Take a look at the Tensorflow docs for more available loss functions: <a href='https://www.tensorflow.org/api_docs/python/tf/keras/losses'>https://www.tensorflow.org/api_docs/python/tf/keras/losses</a>;
4. Define the optimizer. Take a look at the Tensorflow docs for more available optimizers: <a href='https://www.tensorflow.org/api_docs/python/tf/keras/optimizers'>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers</a>;
5. Compile the model. At this moment, the loss and the optimizer are set, together with other settings, like extra metrics to be evaluated at each epoch.

In step 1, a new model is instanciated with the Class of the type of NN that is needed. To this object, new modules are added to define the NN architecture.

In setp 2, the weights and bias are, by deafult, initialized with the Xavier uniform (called Glorot Uniform in Tensorflow) and the zeros method, repectively. However, the initializers can be chosen from the available methods in Tensorflow framework. Take a look at the Tensorflow docs for more available initialization methods:
<a href='https://www.tensorflow.org/api_docs/python/tf/keras/initializers/'>https://www.tensorflow.org/api_docs/python/tf/keras/initializers/</a>.

<br />

In [None]:
tf.config.set_visible_devices([], 'GPU')
print('Num GPUs Available: ', len(tf.config.experimental.list_physical_devices('GPU')))


### 2.1. NN definitions and compilation

In [None]:
#
# Steps 1 and 2
#
# NN architecture
with tf.device('/cpu:0'):
    m = Sequential()
    m.add(keras.Input(shape=(X_train.shape[1], )))
    # In this layer, the weights and bias initializer are explicitly set
    m.add(Dense(3, 
                activation='relu',
                kernel_initializer=tf.keras.initializers.GlorotUniform(),
                bias_initializer=tf.keras.initializers.Zeros())),
    # In this layer, just as an example, the weights and bias initializer are NOT explicitly set
    m.add(Dense(4, activation='relu')),
    # In this layer, the weights and bias initializer are not set.
    m.add(Dense(1, activation='sigmoid'))

    #
    # Steps 3 to 5
    #
    # Compile models NNs
    nn_tf = {}
    for k in ['GD', 'SGD', 'MiniSGD', 'RMSprop', 'Adam']:
        nn_tf.update({k: tf.keras.models.clone_model(m)})
    
    nn_tf['GD'].compile(optimizer=SGD(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    nn_tf['SGD'].compile(optimizer=SGD(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    nn_tf['MiniSGD'].compile(optimizer=SGD(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    nn_tf['RMSprop'].compile(optimizer=RMSprop(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    nn_tf['Adam'].compile(optimizer=Adam(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])


### 2.2. NN fit

In [None]:
# Cheking batch sizes
[batch_size_GD, batch_size_SGD, batch_size_MiniSGD]


In [None]:
history_tf = {}

print('Training GD model... ', end='')
history_tf['GD'] = nn_tf['GD'].fit(
    x=X_train, y=y_train, validation_data=(X_val, y_val), epochs=num_epochs, batch_size=batch_size_GD, verbose=0
)
print('Finished.')

print('Training SGD model... ', end='')
history_tf['SGD'] = nn_tf['SGD'].fit(
    x=X_train, y=y_train, validation_data=(X_val, y_val), epochs=num_epochs, batch_size=batch_size_SGD, verbose=0
)
print('Finished.')

print('Training MiniSGD model... ', end='')
history_tf['MiniSGD'] = nn_tf['MiniSGD'].fit(
    x=X_train, y=y_train, validation_data=(X_val, y_val), epochs=num_epochs, batch_size=batch_size_MiniSGD, verbose=0
)
print('Finished.')

print('Training RMSprop model Training... ', end='')
history_tf['RMSprop'] = nn_tf['RMSprop'].fit(
    x=X_train, y=y_train, validation_data=(X_val, y_val), epochs=num_epochs, batch_size=batch_size_MiniSGD, verbose=0
)
print('Finished.')

print('Training Adam model Training... ', end='')
history_tf['Adam'] = nn_tf['Adam'].fit(
    x=X_train, y=y_train, validation_data=(X_val, y_val), epochs=num_epochs, batch_size=batch_size_MiniSGD, verbose=0
)
print('Finished.')


### 2.3. NN history

In [None]:
n_optimizers = len(history_tf)
fig = make_subplots(
    rows=n_optimizers,
    cols=2,
    subplot_titles=[
        f'Tensorflow {opt_name} - Loss' if i % 2 == 0 else f'Tensorflow {opt_name} - Accuracy'
        for opt_name in history_tf.keys() for i in range(2)
    ],
    vertical_spacing=0.08,
    horizontal_spacing=0.1
)

for row_i, (opt_name, hist) in enumerate(history_tf.items(), start=1):
    
    fig.add_trace(go.Scatter(
        y=hist.history['loss'], mode='lines', name='Train Loss',
        line=dict(color='blue'),
        showlegend=(row_i == 1)
    ), row=row_i, col=1)
    fig.add_trace(go.Scatter(
        y=hist.history['val_loss'], mode='lines', name='Validation Loss',
        line=dict(color='orange'),
        showlegend=(row_i == 1)
    ), row=row_i, col=1)

    fig.add_trace(go.Scatter(
        y=hist.history['accuracy'], mode='lines', name='Train Accuracy',
        line=dict(color='forestgreen'),
        showlegend=(row_i == 1)
    ), row=row_i, col=2)
    fig.add_trace(go.Scatter(
        y=hist.history['val_accuracy'], mode='lines', name='Validation Accuracy',
        line=dict(color='orangered'),
        showlegend=(row_i == 1)
    ), row=row_i, col=2)

    fig.update_yaxes(range=[0, 1.01], row=row_i, col=2)

fig.update_layout(
    height=300 * n_optimizers,
    width=800,
    legend=dict(orientation='h', yanchor='bottom', y=-0.05, xanchor='center', x=0.5),
    margin=dict(t=50, b=50)
)

fig.show()


### 2. 4. NN Decision Regions

In [None]:
plt.figure(figsize=(12, 8))
n_rows = 2
n_cols = 3
for nn_i, nn_name in enumerate(nn_tf.keys()):
    plt.subplot(n_rows, n_cols, nn_i+1)
    plot_decision_regions(X=X_val.cpu().numpy(), 
                      y=y_val.cpu().int().numpy(),
                      clf=nn_tf[nn_name])
    plt.title(nn_name+' Tensorflow NN\nDecision Regions', size=8)
plt.show()


> *Play with different NN architectures, learning rates, momentum, number of iterations (epochs) etc*.

<br />

<hr />

## 3. Excercises (not graded)

- Try different hyperparameters, such as network architecture and learning rates, to improve the NN performance.
- Try different random states to analyse the randomness of the results.
- Fit the NN for a regression dataset.
    
<br />


## 4. Check it out

This website has a very nice tool for testing the effect of the NN hyperparameters on NN results:

<br />

<center>
<h4><a href='https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=xor&regDataset=reg-gauss&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,1,2&seed=0.47596&showTestData=false&discretize=false&percTrainData=30&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false', target='_blank'>Neural Network Playground</a></h4>
</center>

<br />
Enjoy!

<br />
<br />
