# General Deep learning topology design notes:
* usually input layer is not included in layer notation (eg. 2/8 means hidden layer with 2 nodes and output layer with 8)
* 1 hidden layer is enough for [universal approximator](https://en.wikipedia.org/wiki/Universal_approximation_theorem)
    * <sub><sup>¨Specifically, the universal approximation theorem states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.¨
    p.198[1]<sub><sup>
* 1 hidden layer can be sufficient but it is ineffective
    * <sub><sup>"Since a single sufficiently large hidden layer is adequate for approximation of most functions, why would anyone ever use more? One reason hangs on the words “sufficiently large”. Although a single hidden layer is optimal for some functions, there are others for which a single-hidden-layer-solution is very inefficient compared to solutions with more layers."[2]<sub><sup>
* depth increses generalization 
    * <sub><sup>"Empirically, greater depth does seem to result in better generalization for a wide variety of tasks. […] This suggests that using deep architectures does indeed express a useful prior over the space of functions the model learns."[3]<sub><sup>
* too few neurons - underfitting, too many - overfitting<sub><sup>[4]<sub><sup>

    
## 1. Number of Neurons and Layers


### 1.1. Number of Neurons
| Number of Hidden Layers | Result |
| :---                    | :----  |
| none                    |Only capable of representing linear separable functions or decisions.
| 1                       | Can approximate arbitrarily with any functions which con-tains a continuous mapping from one finite space to another.|
| 2                       | Represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy.|
<sup><sup>[4]<sup><sup>

| Search strategy | Description |
| :---            | :----       |
| Random          | Try random configurations of layers and nodes per layer. |
| Grid            | Try a systematic search across the number of layers and nodes per layer. |
| Heuristic       | Try a directed search across configurations such as a genetic algorithm or Bayesian optimization. |
| Exhaustive      | Try all combinations of layers and the number of nodes; it might be feasible for small networks and datasets. |
<sup><sup>[5]<sup><sup>

    
### 1.2 Number of Layers
| Rules of thumb for number of neurons in hidden layer |    
| :--- |
| The number of hidden neurons should be between the size of the input layer and the size of the output layer |
| The number of hidden neurons should be 2/3 of the input layer size, plus the size of the output layer |
| The number of hidden neurons should be less than twice the input layer size |
<sub><sup>[4]<sub><sup>


    
## 2. Hyperparameters    
<sup><sup>[8]<sup><sup>
* use [Ray Tune](https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html)
    
### Learning rate (LR)

* Perform a learning rate range test to identify a “large” learning rate.
* Using the 1-cycle LR policy with a maximum learning rate determined from an LR range test, set a minimum learning rate as a tenth of the maximum.

### Momentum

* Test with short runs of momentum values 0.99, 0.97, 0.95, and 0.9 to get the best value for momentum.
* If using the 1-cycle learning rate schedule, it is better to use a cyclical momentum (CM) that starts at this maximum momentum value and decreases with increasing learning rate to a value of 0.8 or 0.85.

### Batch Size

* Use as large batch size as possible to fit your memory then you compare performance of different batch sizes.
* Small batch sizes add regularization while large batch sizes add less, so utilize this while balancing the proper amount of regularization.
* It is often better to use a larger batch size so a larger learning rate can be used.

| name | batch size | description |
| :--- | :---       | :---        | 
| Batch Gradient Descent (BGD) | Size of Training Set | converges slowly with accurate estimates of the error gradient |
| Stochastic Gradient Descent (SGD) | 1 | converges fast with noisy estimates of the error gradient | 
| Mini-Batch Gradient Descent | 1 < Batch Size (b) < Size of Training Set | balance between the robustness of SGD and the efficiency of BGD; most common; additional parameter b | 
<sup><sup>[6][8]<sup><sup>
    
### Weight decay

* A grid search to determine the proper magnitude but usually does not require more than one significant figure accuracy.
* A more complex dataset requires less regularization so test smaller weight decay values, such as 10−4 , 10−5 , 10−6 , 0.
* A shallow architecture requires more regularization so test larger weight decay values, such as 10−2 , 10−3 , 10−4 .
    
    
## 3. Prunning 
* evaluating the weighted connections between the layers,  If the network contains any hidden neurons which contain only zero weighted connections, they can be removed.
* connections - determine which connections have the least impact to the effectiveness of the neural network, eg. (i) connections with weight below some threshold, (ii) effectivness of neural net if we remove some connections
* neurons - determine which neurons are surrounded by weak connections
    * possible slight increase/decrease in accuracy, yo uhave evaluate before and after
    * Incremental Pruning - essentially forward trial and error selection, just increase number of neurons, check erorr rate, lower number of neurones with lowest eror rate wins
		- eg "check the current error rate in 1,000 cycle intervals.  If the error does not decrease by a single percentage point, then the search will be abandoned."
    * Selective Pruning - "examining the weight matrixes of a previously trained neural network.  The selec-tive training algorithm will then attempt to remove neurons without disrupting the output of the neural network."
	* [PyTorch pruning tutorial](https://pytorch.org/tutorials/intermediate/pruning_tutorial.html#pruning-a-module)


### 3.1 Trial and error selection method approaches determining the number of hidden neurons
| method   | Description |
| :---     | :----       |
| forward  | start by 2 neurons, train, evaluate and increase the number as long as it improves |
| backward | Start with large number of neurons and remove them until the performance is still acceptable |
<sup><sup>[4]<sup><sup>



## 4. Model evaluation
* (i) split test/train, (ii) k-fold, (iii) fixed random seed<sup><sup>[7]<sup><sup>

## 5. Weight initialization
* prevent layer activation outputs from exploding/vanishing during training
    * loss gradients too large/small to flow backwards -> longer/no convergence of the network
    

<sub><sup>
    [1] Goodfelow, I., Yoshua Bengio, and Aaron Courville. "Deep Learning (Adaptive Computation and Machine Learning Series)." (2016): 800  
    [2] Reed, Russell, and Robert J. MarksII. Neural smithing: supervised learning in feedforward artificial neural networks. Mit Press, 1999.  
    [3] Goodfelow, I., Yoshua Bengio, and Aaron Courville. "Deep Learning (Adaptive Computation and Machine Learning Series)." (2016): 800.  
    [4] Heaton, Jeff. Introduction to neural networks with Java. Heaton Research, Inc., 2008.  
    [5] [How to Configure the Number of Layers and Nodes in a Neural Network](https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/)  
    [6] [Difference between a batch and an Epoch](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/)  
    [7] [Evaluate skill deep learning models](https://machinelearningmastery.com/evaluate-skill-deep-learning-models/)  
    [8] [Hyper-parameter Tuning Techniques in Deep Learning](https://towardsdatascience.com/hyper-parameter-tuning-techniques-in-deep-learning-4dad592c63c8)  
    [9] [Weight initialization in neural networks a journey from the basics to Kaiming](https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79)  
<sub><sup>

# AutoEncoder in PyTorch - general approach
* [Introduction to Variational AutoEncoders](https://debuggercafe.com/getting-started-with-variational-autoencoder-using-pytorch/)
* [Understanding Variational Autoencoders VAEs](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73)
* https://github.com/AntixK/PyTorch-VAE/blob/master/models/vanilla_vae.py
* https://github.com/AissamDjahnine/Autoencoder-Pytorch/blob/master/Autoencoder-Pytorch.ipynb
* https://github.com/nathanhubens/Autoencoders/blob/master/Variational%20Autoencoders.ipynb
* https://github.com/kvfrans/variational-autoencoder

In [4]:
# import torch
# from torch import nn
# import torch.nn.functional as F
# import numpy as np

In [3]:
# def basic_hidden_layer(input_dim, output_dim):
#     return nn.Sequential(
#         nn.Linear(input_dim, output_dim),
#         nn.ReLU(inplace=True),
#     )

# def basic_output_layer(input_dim, output_dim):
#     return nn.Sequential(
#         nn.Linear(input_dim, output_dim),
#         nn.ReLU(inplace=True),
#     )

# class GenNet(nn.Module):
#     def __init__(self, input_dim=10, hidden_dim1=784, hidden_dim1=128):
#         super(AutoEncoder, self).__init__()
#         # Build the neural network
#         self.layers = nn.Sequential(
#             basic_hidden_layer(z_dim, hidden_dim),
#             basic_hidden_layer(hidden_dim, hidden_dim * 2),
#         )
        
#     def _init_weights()
    
#     def forward(self, input_x):
#         return self.layers(input_x)

In [2]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

In [3]:
def binary_sampler(p, rows, cols):
    '''Sample binary random variables.
  
    Args:
    - p: probability of 1
    - rows: the number of rows
    - cols: the number of columns
    
    Returns:
    - binary_random_matrix: generated binary random matrix.
    '''
    unif_random_matrix = np.random.uniform(0., 1., size = [rows, cols])
    binary_random_matrix = 1*(unif_random_matrix < p)
    return binary_random_matrix

def create_missing_data(data_x, probability):
    no, dim = data_x.shape
    mask = binary_sampler(1-probability, no, dim)
    data_x_missing = data_x.copy()
    data_x_missing[mask == 0] = 0
    return data_x_missing, mask
        
def cust_dataloader(data_x_scaled, batch_size, probability=0.1, shuffle=True):
    data_x_missing, mask = create_missing_data(data_x, probability)
    # added .float() as I was getting expected scalar type Float but found Double (numpy stores as Double? https://discuss.pytorch.org/t/pytorch-why-is-float-needed-here-for-runtimeerror-expected-scalar-type-float-but-found-double/98741)
    data_x_missing = torch.Tensor(data_x_missing).float()
    mask = torch.Tensor(mask).float()
    tensor_data_x_mask = TensorDataset(data_x_missing, mask)
    return DataLoader(tensor_data_x_mask, batch_size=batch_size, shuffle=shuffle)

In [15]:
# how to use : 
# NetModel.apply(init_weights)
# the weights are initialized automatically by Kaiming He:
# https://stackoverflow.com/a/56773737/8147433
def init_weights(NetModel):
    if type(NetModel) == nn.Linear:
        # maybe weight.data?, also maybe gain?
        # https://discuss.pytorch.org/t/how-to-fix-define-the-initialization-weights-seed/20156/5
        # as we are using relu....
        torch.nn.init.xavier_normal_(NetModel.weight.data, gain=nn.init.calculate_gain('relu'))
        # in original GAIN post it is 0 but I have also seen 0.01
        NetModel.bias.data.fill_(0)
        #torch.nn.init.xavier_normal_(NetModel.bias.data)

def get_gain_net_block(input_dim, output_dim):
    return nn.Sequential(
        nn.Linear(input_dim, output_dim),
        nn.ReLU(inplace=True),
    )

class Generator(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Generator, self).__init__()
        # Build the neural network
        self.net = nn.Sequential(
            get_gain_net_block(input_dim*2, hidden_dim),
            get_gain_net_block(hidden_dim, hidden_dim),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid(),
        )
        
    def forward(self, data_w_noise, mask):
        '''
        Function for completing a forward pass of the GainNet
        '''
        input_data = torch.cat(tensors=[data_w_noise, mask], dim = 1).float()
        return self.net(input_data)
    
class Discriminator(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Discriminator, self).__init__()
        # Build the neural network
        self.net = nn.Sequential(
            get_gain_net_block(input_dim*2, hidden_dim),
            get_gain_net_block(hidden_dim, hidden_dim),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid(),
        )
        
    def forward(self, imputed_data, hint_matrix):
        '''
        Function for completing a forward pass of the GainNet
        '''
        input_data = torch.cat(tensors=[imputed_data, hint_matrix], dim = 1).float()
        return self.net(input_data)

# per https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html#torch.nn.BCEWithLogitsLoss
disc_criterion = torch.nn.BCELoss(reduction = 'mean')

def discriminator_loss(gen, disc, disc_criterion, mask, data_w_noise, hint_matrix):
    # Generator
    # from Coursera GAN lectures/notebooks:
    # Since the generator is needed when calculating the discriminator's loss, you will need to 
    # call .detach() on the generator result to ensure that only the discriminator is updated!
    # related to: https://stackoverflow.com/a/58699937/8147433
    generator_output = gen(data_w_noise, mask).detach()
    # Combine with original data
    imputed_data = data_w_noise * mask + generator_output * (1-mask)
    # Discriminator
    D_prob = disc(imputed_data, hint_matrix)
    # Loss
    D_loss = disc_criterion(D_prob, mask)
    #D_loss = -torch.mean(mask * torch.log(D_prob + 1e-8) + (1-mask) * torch.log(1. - D_prob + 1e-8))
    return D_loss

gen_criterion = torch.nn.MSELoss(reduction = 'mean')
def generator_loss(gen, disc, gen_criterion, data, mask, data_w_noise, hint_matrix):
    # Generator
    generator_output = gen(data_w_noise, mask)
    # Combine with original data
    imputed_data = data_w_noise * mask + generator_output * (1-mask)
    # Discriminator
    D_prob = disc(imputed_data, hint_matrix)
    # Loss
    G_loss1 = -torch.mean((1-mask) * torch.log(D_prob + 1e-8))
    MSE_train_loss = gen_criterion(mask * generator_output, mask * data_w_noise)

    G_loss = G_loss1 + alpha * MSE_train_loss 
    return G_loss, MSE_train_loss

In [13]:
# set hyper-parameters
gain_parameters = {'batch_size': 100,
                  'hint_rate': 0.9,
                  'alpha': 100,
                  'epochs': 10,
                  'learning_rate': 0.001,
                  'device': 'cpu'}

batch_size = gain_parameters['batch_size']
hint_rate = gain_parameters['hint_rate']
alpha = gain_parameters['alpha']
epochs = gain_parameters['epochs']
learning_rate = gain_parameters['learning_rate']
device = gain_parameters['device']

# initialize your generator, discriminator, and optimizers
# Note: each optimizer only takes the parameters of one particular model, 
# since we want each optimizer to optimize only one of the model  
gen = Generator(52, 52).to(device)
gen.apply(init_weights)
gen_opt = torch.optim.Adam(gen.parameters(), lr=learning_rate)
disc = Discriminator(52, 52).to(device)
disc.apply(init_weights)
disc_opt = torch.optim.Adam(disc.parameters(), lr=learning_rate)

# load and scale the data 
data_x = pd.read_csv('#datasets/Tennessee_Event-Driven/tep_train_extended.csv')
data_x = data_x[data_x['simulationRun']==1].drop(columns=['faultNumber','simulationRun','sample']).values

scaler = MinMaxScaler(feature_range=(0,1))
scaler.fit(data_x)
data_x_scaled = scaler.transform(data_x)
# create missing data and create tensor dataset that includes data and masks
dataset = cust_dataloader(data_x_scaled, batch_size)

In [16]:
for epoch in range(epochs)[:1]:
    for i,(data, mask) in enumerate(dataset):
        # /100 as noise was added in the original paper from uniform distribution <0,0.01>
        noise = (1-mask) * torch.rand(mask.shape)/100
        hint_matrix = mask * binary_sampler(hint_rate, mask.shape[0], mask.shape[1])
        data_w_noise = data + noise
    
        disc_opt.zero_grad()
        D_loss_curr = discriminator_loss(gen, disc, disc_criterion, mask, data_w_noise, hint_matrix)
        print('D_loss_curr: ' + str(D_loss_curr))
        D_loss_curr.backward(retain_graph=True)
        disc_opt.step()
        
        gen_opt.zero_grad()
        G_loss_curr, MSE_train_loss_curr = generator_loss(gen, disc, gen_criterion, data, mask, data_w_noise, hint_matrix)
        print('G_loss_curr: ' + str(G_loss_curr))
        print('MSE_train_loss_curr: ' + str(MSE_train_loss_curr))
        G_loss_curr.backward(retain_graph=True)
        gen_opt.step()

        
        if i % 100 == 0:
            print('Iter: {}'.format(i))
            print('Train_loss: {:.4}'.format(np.sqrt(MSE_train_loss_curr.item())))
            print()

D_loss_curr: tensor(26.9355, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(99769944., grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(997699.4375, grad_fn=<MseLossBackward>)
Iter: 0
Train_loss: 998.8

D_loss_curr: tensor(26.5593, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(1.0058e+08, grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(1005815.4375, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(27.3866, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(1.0379e+08, grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(1037887.1250, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(26.8655, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(1.0095e+08, grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(1009537.3750, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(27.0193, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(98949416., grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(989494.1250, grad_fn=<MseLossBackward>)
D_loss_curr: tens

D_loss_curr: tensor(26.5106, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(99891304., grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(998913.0625, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(27.4231, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(1.0047e+08, grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(1004709.5000, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(26.4830, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(1.0117e+08, grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(1011690.1250, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(26.5794, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(98376984., grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(983769.8750, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(25.9257, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(1.0018e+08, grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(1001750.1250, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(26.4325, grad_fn=<Binary

D_loss_curr: tensor(27.0192, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(98925352., grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(989253.5000, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(27.4423, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(98832272., grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(988322.6875, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(26.5577, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(99363504., grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(993635.0625, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(26.6180, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(97367632., grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(973676.3125, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(26.4615, grad_fn=<BinaryCrossEntropyBackward>)
G_loss_curr: tensor(1.0326e+08, grad_fn=<AddBackward0>)
MSE_train_loss_curr: tensor(1032623.1875, grad_fn=<MseLossBackward>)
D_loss_curr: tensor(26.5017, grad_fn=<BinaryCros