# A Paradigm shift from Back Propagation to Forward-Forward: Empirical Investigation

### Importing necessary libraries
In this section, we import the required libraries for our project. The libraries used are `torch`, `numpy`, `pandas`, and `time`. `torch` is a deep learning library, `numpy` is used for numerical computations, `pandas` is used for data manipulation and analysis, and `time` is used for timing the execution of our code.

We also import the `tqdm` library, which is used for creating progress bars. This library is useful when working with large datasets, as it provides a visual representation of the progress of the operations being performed.

The `MNIST` dataset from `torchvision.datasets` is used for our project. The `Compose`, `ToTensor`, `Normalize`, and `Lambda` classes from `torchvision.transforms` are used for preprocessing the data. The `DataLoader` class from `torch.utils.data` is used for loading and batching the data.

We define the `device` variable, which is used to specify whether we will be using the GPU (if available) or the CPU for our computations.


In [1]:
import torch
import numpy as np
import pandas as pd
import time

from tqdm import tqdm 
from torchvision.datasets import MNIST
from torchvision.transforms import Compose, ToTensor, Normalize, Lambda
from torch.utils.data import DataLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


### Overlay function
The `overlay_y_on_x` function takes two tensors `x` and `y` as input. The purpose of this function is to create a copy of the input tensor `x` and modify it to overlay the values specified by tensor `y` onto the copied tensor.

We first create a copy of the input tensor `x` using the `clone()` method and assign it to a new tensor `x_`.

Next, we set the first 10 columns of the copied tensor `x_` to zero by multiplying them with `0.0`.

We then set the elements at the indices specified by tensor `y` to the maximum value of `x` using numpy's indexing functionality. This operation results in the overlay of the values from `y` onto `x_`.

Finally, we return the modified tensor `x_`.


In [2]:
def overlay_y_on_x(x, y):
    # create a copy of the input tensor x
    x_ = x.clone()
    
    # set the first 10 columns of the copied tensor x_ to zero
    x_[:, :10] *= 0.0
    
    # set the elements at the indices specified by y to the maximum value of x
    x_[range(x.shape[0]), y] = x.max()
    
    # return the modified tensor x_
    return x_


### Neural Network Class
The `Net` class extends the `torch.nn.Module` class to define a neural network. The purpose of this class is to encapsulate the functionality of a neural network and make it easy to use.

The constructor of the `Net` class takes an array of integers `dims` as input, which represents the dimensions of the input data and the layers of the network. We start by calling the constructor of the parent class (`torch.nn.Module`) using `super().__init__()`.

Next, we initialize an empty list `layers` to store the Layer objects that make up the network. Then, we loop through the dimensions of the input data to create the layers. For each iteration of the loop, we create a Layer object with the input dimension `dims[d]` and output dimension `dims[d + 1]`, and add it to the list of layers.

The `predict` method of the `Net` class takes a tensor `x` as input and returns the label with the maximum goodness for each input data. To accomplish this, we first initialize an empty list `goodness_per_label` to store the goodness of each label. Then, we loop through each label and create an overlay of `x` with the current label using the `overlay_y_on_x` function. 

Next, we initialize an empty list `goodness` to store the goodness of each layer for the current label. We then loop through each layer and apply the current layer on the input data. We calculate the mean of the squared elements of the output data, which represents the goodness of the current layer, and add it to the list of `goodness`.

Finally, we add the sum of the goodness of each layer for the current label to the list of `goodness_per_label`, and concatenate the `goodness_per_label` along the first dimension. The resulting tensor `goodness_per_label` has the same number of rows as the input data and 10 columns, where each column represents the goodness of each label. We return the index of the label with the maximum goodness using `argmax(1)`.

The `train` method of the `Net` class takes two tensors `x_pos` and `x_neg` as input, which represent the positive and negative examples, respectively. We start by initializing the positive and negative examples with the input data. Then, we loop through each layer and train the current layer with the positive and negative examples. The training process is handled by the `train` method of the Layer class. We print the current layer number to keep track of the training progress.


In [3]:
class Net(torch.nn.Module):
    # Define the constructor of the Net class
    def __init__(self, dims):
        # Call the constructor of the parent class (torch.nn.Module)
        super().__init__()
        
        # Initialize an empty list to store the Layer objects
        self.layers = []
        
        # Loop through the dimensions of the input data to create the layers
        for d in range(len(dims) - 1):
            # Create a Layer object with input dimension dims[d] and output dimension dims[d + 1]
            # and add it to the list of layers
            self.layers += [Layer(dims[d], dims[d + 1]).cuda()]

    # Define the `predict` method
    def predict(self, x):
        # Initialize an empty list to store the goodness of each label
        goodness_per_label = []
        
        # Loop through each label
        for label in range(10):
            # Create an overlay of x with the current label
            h = overlay_y_on_x(x, label)
            
            # Initialize an empty list to store the goodness of each layer
            goodness = []
            
            # Loop through each layer
            for layer in self.layers:
                # Apply the current layer on the input data
                h = layer(h)
                
                # Calculate the mean of the squared elements of the output data
                goodness += [h.pow(2).mean(1)]
                
            # Add the sum of the goodness of each layer for the current label to the list of goodness per label
            goodness_per_label += [sum(goodness).unsqueeze(1)]
            
        # Concatenate the goodness of each label along the first dimension
        goodness_per_label = torch.cat(goodness_per_label, 1)
        
        # Return the index of the label with the maximum goodness
        return goodness_per_label.argmax(1)

    # Define the `train` method
    def train(self, x_pos, x_neg):
        # Initialize the positive and negative examples with the input data
        h_pos, h_neg = x_pos, x_neg
        
        # Loop through each layer
        for i, layer in enumerate(self.layers):
            # Print the current layer number
            print('training layer', i, '...')
            
            # Train the current layer with the positive and negative examples
            h_pos, h_neg = layer.train(h_pos, h_neg)


### Building a Custom Layer
We are now going to build a custom layer that can learn a non-linear feature representation of the input data. Our custom layer will inherit the properties of the torch.nn.Linear layer and will also have additional functionality such as training the layer on positive and negative samples, and calculating the activation of the input data.

Next, we'll define our custom layer by creating a new class called Layer and inheriting from torch.nn.Linear.

In the `__init__` method, we are initializing the `torch.nn.Linear` layer by calling `super().__init__()`. We are also instantiating a `ReLU` activation function, creating an optimizer, setting a threshold value, and setting the number of epochs for training the layer.

Now, we'll define the train method, which trains the layer on positive and negative samples.

In [4]:
class Layer(torch.nn.Linear):
    def __init__(self, in_features, out_features, bias=True, device=None, dtype=None):
        # Initializing the parent class `torch.nn.Linear`
        super().__init__(in_features, out_features, bias, device, dtype)
        # Instantiating a ReLU activation function
        self.relu = torch.nn.ReLU()
        # Creating an optimizer
        self.opt = torch.optim.Adam(self.parameters(), lr=0.09)
        # Setting a threshold value
        self.threshold = 9.0
        # Number of epochs for training the layer
        self.num_epochs = 1000

    def forward(self, x):
        # Normalizing the input `x`
        x_direction = x / (x.norm(2, 1, keepdim=True) + 1e-4)
        # Calculating the linear activation of `x_direction`
        # and applying ReLU activation on the result

        return self.relu(torch.mm(x_direction, self.weight.T) + self.bias.unsqueeze(0))

    def train(self, x_pos, x_neg):
        # Loop over the number of epochs
        for i in tqdm(range(self.num_epochs)):
            # Calculating the positive samples' goodness values
            g_pos = self.forward(x_pos).pow(2).mean(1)
            # Calculating the negative samples' goodness values
            g_neg = self.forward(x_neg).pow(2).mean(1)
            # Computing the loss that pushes positive samples to values
            # larger than the `self.threshold` and negative samples to
            # values smaller than the `self.threshold`
            loss = torch.log(1 + torch.exp(torch.cat([
                -g_pos + self.threshold,
                g_neg - self.threshold]))).mean()
            # Zeroing the gradients
            self.opt.zero_grad()
            # Compute the derivative of the loss
            loss.backward()
            # Update the layer's parameters
            self.opt.step()
            
        # Return the final activations of positive and negative samples
        return self.forward(x_pos).detach(), self.forward(x_neg).detach()


This code defines a class `FBNet` which extends the `torch.nn.Module` class from the PyTorch library.

The `__init__` method of the class initializes various attributes of the network such as the number of hidden layers (`n_layers`), the input and output sizes (`n_input` and `n_output`), the learning rate (`lr`), and the loss function, optimizer, etc.

The `forward` method takes an input tensor `x` and applies a series of fully connected linear layers (stored in the `layer`s list) with ReLU activation, followed by a final output layer. The result is then returned.

The train method trains the network on a given input and target tensors `x` and `y` for a specified number of epochs. It performs a forward pass, calculates the loss, and updates the parameters using backpropagation and the optimizer. The accuracy of the network on the target tensor is returned.

The `eval` method evaluates the network on a given input and target tensors `x` and `y`. It performs a forward pass and calculates the accuracy by comparing the predictions with the target tensor. The accuracy is returned.

It is worth noting that all tensors are moved to the GPU before any operations are performed on them to take advantage of GPU acceleration.

In [5]:
class FBNet(torch.nn.Module):
    def __init__(self, n_layers=2, n_input=10, n_output=4, lr=0.001):
        super().__init__()
        # initialize the number of hidden layers in the network
        self.n_layers = n_layers
        # initialize a list of n_layers fully connected linear layers with n_input as input size
        # and n_input as output size
        self.layers = torch.nn.ModuleList([torch.nn.Linear(n_input, n_input) for i in range(n_layers)])
        # initialize the final output layer
        self.out = torch.nn.Linear(n_input, n_output)
        # initialize the loss function as cross entropy loss
        self.criterion = torch.nn.CrossEntropyLoss()
        # initialize the optimizer as Adam with learning rate=lr
        self.optimizer = torch.optim.Adam(self.parameters(), lr=lr)

    def forward(self, x):
        # move the input tensor to GPU
        x = x.to("cuda")
        # loop through all the layers in the list of layers
        for layer in self.layers:
            # apply the layer on the input tensor
            x = layer(x)
            # apply the ReLU activation function on the result
            x = torch.nn.functional.relu(x)
        # apply the final output layer on the result
        x = self.out(x)
        # return the result
        return x
    
    def train(self, x, y, epochs):
        # loop through all the epochs
        for _ in range(epochs):
            # zero the gradients of all parameters
            self.optimizer.zero_grad()
            # apply the forward pass on the input tensor
            output = self.forward(x)
            # move the target tensor to GPU
            y = y.to("cuda")
            # calculate the loss between the output and target tensors
            loss = self.criterion(output, y)
            # perform backpropagation to update the parameters
            loss.backward()
            # update the parameters
            self.optimizer.step()
        # calculate accuracy by comparing the predictions with the target tensor
        _, predictions = torch.max(output, 1)
        accuracy = (predictions == y).float().mean()

        return accuracy.item()
    
    def eval(self, x, y):
      # move the input and target tensors to GPU
      x = x.to("cuda")
      y = y.to("cuda")
      # disable gradient calculation
      with torch.no_grad():
          # apply the forward pass on the input tensor
          output = self.forward(x)
          # get the predictions by finding the index of the highest probability in the output tensor
          predictions = torch.argmax(output, dim=1)
          # calculate accuracy by comparing the predictions with the target tensor
          accuracy = (predictions == y).float().mean()
      return accuracy.item()


The `MNIST_loaders` function creates and returns data loaders for the MNIST dataset, with the ability to set the batch size for the training and test sets. The function applies several transformations to the data, including converting it to a tensor, normalizing it with mean and standard deviation, and flattening it. The function creates two data loaders, one for the training set and one for the test set, both of which are loaded from the MNIST dataset with the path './data/' and are downloaded if not present. The batch size and shuffle flag for each set can be set as desired, with the training set being shuffled for each epoch and the test set being kept in order.

In [6]:
def MNIST_loaders(train_batch_size=50000, test_batch_size=10000):
    # Compose a list of transforms to be applied to the data
    transform = Compose([
        ToTensor(),
        # Normalize the data with mean and standard deviation
        Normalize((0.1307,), (0.3081,)),
        # Flatten the tensor data
        Lambda(lambda x: torch.flatten(x))])

    # Create a DataLoader for the training set
    train_loader = DataLoader(
        # Load the MNIST dataset from the given path, set train=True to use the training set, set download=True to download the dataset if not present
        MNIST('./data/', train=True,
              download=True,
              # Apply the transforms defined above to the data
              transform=transform),
        # Set the batch size for the training set
        batch_size=train_batch_size,
        # Set the shuffle flag to True to shuffle the training data for each epoch
        shuffle=True)

    # Create a DataLoader for the test set
    test_loader = DataLoader(
        # Load the MNIST dataset from the given path, set train=False to use the test set, set download=True to download the dataset if not present
        MNIST('./data/', train=False,
              download=True,
              # Apply the transforms defined above to the data
              transform=transform),
        # Set the batch size for the test set
        batch_size=test_batch_size,
        # Set the shuffle flag to False to keep the test data in order
        shuffle=False)

    # Return the train_loader and test_loader
    return train_loader, test_loader


layers, input nodes, and output nodes. The FBNet model is initialized with these parameters and returned.

In [7]:
def get_model(n_layers, n_input, n_output):
    # Initialize the FBNet model with the given number of layers, input nodes, and output nodes
    model = FBNet(n_layers=n_layers, n_input=n_input, n_output=n_output)
    
    # Return the initialized model
    return model

This code is a function for training and evaluating a PyTorch deep learning model on the MNIST dataset. It trains the model on the training data for a specified number of epochs, and evaluates the accuracy on both the training and test data. The input `model` is a PyTorch neural network model, and the number of epochs can be specified by the `epochs` argument, with a default value of 1.

The function starts by loading the MNIST dataset into the `train_loader` and `test_loader` variables. The model is then moved to the GPU if available, using the `model.to("cuda")` method.

The training of the model is performed inside the `torch.autocast("cuda", dtype=torch.float32)` context, which allows certain operations to be automatically cast to float32 on the GPU, if available. The model is trained for the specified number of epochs, and for each epoch, the accuracy on the training data is calculated and stored in the `acc_train` list.

The evaluation of the model is also performed inside the `torch.autocast("cuda", dtype=torch.float32)` context, and the accuracy on the test data is stored in the `acc_test` list.

Finally, the mean accuracy on both the training and test data is calculated and returned by the function, as `train_acc` and `test_acc` respectively.

In [22]:
def train_and_get_eval_fb(model, epochs=1):
    # Load the training and testing data
    train_loader, test_loader = MNIST_loaders()
    
    # Move the model to the GPU if available
    model = model.to("cuda")
    
    # Use the torch.autocast context to allow automatic casting of certain operations 
    # to float32 on the GPU, if available
    with torch.autocast("cuda", dtype=torch.float32):
        acc_train_list = []
        train_time_list = []
        # Loop over the desired number of epochs
        for _ in range(epochs):
            # Loop over the batches of training data
            for x, y in train_loader:
                # Move the input and label tensors to the GPU
                x.to("cuda")
                y.to("cuda")
                # Train the model on the current batch
                train_1 = time.time()
                acc_train = model.train(x, y, 80)
                train_2 = time.time() - train_1
                acc_train_list.append(acc_train)
                train_time_list.append(train_2)
    
    # Evaluation on the test data
    with torch.autocast("cuda", dtype=torch.float32):
        # Initialize a list to store the accuracy values on the test data
        acc_test_list = []
        test_time_list = []
        # Loop over the batches of test data
        for x, y in test_loader:
            # Move the input and label tensors to the GPU
            x.to("cuda")
            y.to("cuda")
            # Evaluate the model on the current batch
            test_1 = time.time()
            test_acc = model.eval(x, y)
            test_2 = time.time() - test_1
            # Append the accuracy to the list
            acc_test_list.append(test_acc)
            test_time_list.append(test_2)
    
    # Calculate the mean accuracy on the training data
    train_acc = torch.tensor(acc_train_list).mean().item()
    # Calculate the mean accuracy on the test data
    test_acc = torch.tensor(acc_test_list).mean().item()
    train_time = torch.tensor(train_time_list).mean().item()
    test_time = torch.tensor(test_time_list).mean().item()

    # Return the mean accuracy on the training and test data
    return train_acc, test_acc, train_time, test_time


This code defines a function `train_and_get_eval_ff` that trains a neural network using the MNIST dataset and returns the mean accuracy on the training and test data. The function takes an optional argument `epochs` which specifies the number of times the training should be repeated.

The first step is to load the MNIST dataset using the `MNIST_loaders` function and to split the data into training and test sets using the `train_loader` and `test_loader` variables, respectively. Then, a neural network with 2 hidden layers of 512 neurons each is created using the `Net` class.

The first batch of data from the training set is retrieved and moved to the GPU. Two sets of samples are then created, one with the correct labels (`x_pos`) and one with random labels (`x_neg`). The neural network is trained on these samples using the `train` method.

Finally, the mean accuracy of the trained model on the training and test data is calculated by evaluating the model on the data and computing the mean of the prediction accuracy. The function returns the mean accuracy on the training and test data.

In [23]:
def train_and_get_eval_ff(epochs=1):
    # Load MNIST dataset for training and testing
    train_loader, test_loader = MNIST_loaders()

    # Create a neural network with 2 hidden layers of 512 neurons each
    net = Net([784, 512, 512])
    
    # Get the first batch of data from the training set
    x, y = next(iter(train_loader))
    # Move the data to GPU
    x, y = x.cuda(), y.cuda()
    
    acc_train_list = []
    train_time_list = []

    # Create positive samples with correct labels
    x_pos = overlay_y_on_x(x, y)
    # Create negative samples with random labels
    rnd = torch.randperm(x.size(0))
    x_neg = overlay_y_on_x(x, y[rnd])
    # Train the network on the positive and negative samples
    train_1 = time.perf_counter()
    net.train(x_pos, x_neg)
    train_2 = time.perf_counter() - train_1
    train_time_list.append(train_2)

    # Evaluate the model on the training data and get the mean accuracy
    train_acc = net.predict(x).eq(y).float().mean().item()
    # print('train acc:', train_acc)
    acc_train_list.append(train_acc)


    acc_test_list = []
    test_time_list = []

    test_1 = time.perf_counter()
    # Get the first batch of data from the test set
    x_te, y_te = next(iter(test_loader))
    # Move the data to GPU
    x_te, y_te = x_te.cuda(), y_te.cuda()
    test_2 = time.perf_counter() - test_1
    test_time_list.append(test_2)

    # Evaluate the model on the test data and get the mean accuracy
    test_acc = net.predict(x_te).eq(y_te).float().mean().item()
    acc_test_list.append(test_acc)
    # print('test acc:', test_acc)
    # Calculate the mean accuracy on the training data
    train_acc = torch.tensor(acc_train_list).mean().item()
    # Calculate the mean accuracy on the test data
    test_acc = torch.tensor(acc_test_list).mean().item()
    train_time = torch.tensor(train_time_list).mean().item()
    test_time = torch.tensor(test_time_list).mean().item()

    return train_acc, test_acc, train_time, test_time


The code is an implementation of a script to evaluate the accuracy of two different neural network models, FFNet and FBNet. The script uses the `if __name__ == "__main__":` statement, which ensures that the code inside the block will only be executed if the script is being run as the main program and not imported as a module.

The script first creates two sets of lists to store the training and testing accuracy results of both models. The two lists are `ff_train_accs` and `ff_test_accs` for FFNet and `fb_train_accs` and `fb_test_accs` for FBNet.

The script then runs a loop 11 times using the `tqdm` library, which is used to show a progress bar for the loop. In each iteration of the loop, the script sets the number of layers, input size, and output size for the model. Then, it gets the FBNet model using the `get_model` function and trains and evaluates the model using the `train_and_get_eval_fb` function. The resulting training and testing accuracy are then stored in the corresponding lists for FBNet. The same process is repeated for FFNet using the `train_and_get_eval_ff` function.

Finally, the script prints the average accuracy results for both models with the standard deviation and saves the results to a `.csv` file using the pandas library. The results are stored in a pandas DataFrame with four columns, `ff_train_acc`, `ff_test_acc`, `fb_train_acc`, and `fb_test_acc`.

In [25]:
if __name__ == "__main__":
    ff_train_accs, ff_test_accs = [], []  # Lists to store training and testing accuracy of FFNet
    fb_train_accs, fb_test_accs = [], []  # Lists to store training and testing accuracy of FBNet
    fb_train_timestamps, fb_test_timestamps = [], []  # Lists to store training and testing time of FBNet
    ff_train_timestamps, ff_test_timestamps = [], []  # Lists to store training and testing time of FFNet



    # Loop to run the model multiple times and average the results
    for _ in tqdm(range(11)):   # tqdm library is used to show the progress bar for the loop
        n_layers = 2
        n_input = 784
        n_output = 10

        # Get the FBNet model
        model = get_model(n_layers, n_input, n_output)

        # Train and evaluate the FBNet model
        fb_train_acc, fb_test_acc, fb_train_time, fb_test_time = train_and_get_eval_fb(model)
        fb_train_timestamps.append(fb_train_time)
        fb_test_timestamps.append(fb_test_time)
        fb_train_accs.append(fb_train_acc)
        fb_test_accs.append(fb_test_acc)

        # Train and evaluate the FFNet model
        ff_train_acc, ff_test_acc, ff_train_time, ff_test_time = train_and_get_eval_ff()
        ff_train_timestamps.append(ff_train_time)
        ff_test_timestamps.append(ff_test_time)
        ff_train_accs.append(ff_train_acc)
        ff_test_accs.append(ff_test_acc)
        
    # Print the final accuracy results
    print(" ")
    print("FFNet train accuracy: {:.4f} +- {:.4f}".format(np.mean(ff_train_accs), np.std(ff_train_accs)))
    print("FFNet test accuracy: {:.4f} +- {:.4f}".format(np.mean(ff_test_accs), np.std(ff_test_accs)))
    print("FBNet train accuracy: {:.4f} +- {:.4f}".format(np.mean(fb_train_accs), np.std(fb_train_accs)))
    print("FBNet test accuracy: {:.4f} +- {:.4f}".format(np.mean(fb_test_accs), np.std(fb_test_accs)))
    
    # Save the results to a csv file using pandas
    df = pd.DataFrame({"ff_train_acc": ff_train_accs, "ff_test_acc": ff_test_accs, 
                       "ff_train_time": ff_train_timestamps, "ff_test_time": ff_test_timestamps,
                       "fb_train_acc": fb_train_accs, "fb_test_acc": fb_test_accs, 
                       "fb_train_time": fb_train_timestamps, "fb_test_time": fb_test_timestamps})
    df.to_csv("accs.csv")


  0%|          | 0/11 [00:00<?, ?it/s]

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.93it/s][A
  3%|▎         | 30/1000 [00:00<00:36, 26.39it/s] [A
  4%|▎         | 37/1000 [00:01<00:43, 22.00it/s][A
  4%|▍         | 42/1000 [00:01<00:47, 20.23it/s][A
  5%|▍         | 46/1000 [00:02<00:49, 19.20it/s][A
  5%|▍         | 49/1000 [00:02<00:52, 18.22it/s][A
  5%|▌         | 52/1000 [00:02<00:53, 17.57it/s][A
  6%|▌         | 55/1000 [00:02<00:55, 17.18it/s][A
  6%|▌         | 57/1000 [00:02<00:55, 17.11it/s][A
  6%|▌         | 59/1000 [00:02<00:56, 16.79it/s][A
  6%|▌         | 61/1000 [00:02<00:56, 16.56it/s][A
  6%|▋         | 63/1000 [00:03<00:58, 16.05it/s][A
  6%|▋         | 65/1000 [00:03<00:59, 15.70it/s][A
  7%|▋         | 67/1000 [00:03<00:59, 15.61it/s][A
  7%|▋         | 69/1000 [00:03<00:59, 15.61it/s][A
  7%|▋         | 71/1000 [00:03<00:59, 15.53it/s][A
  7%|▋         | 73/1000 [00:03<00:59, 15.69it/s][A
  8%|▊         | 75/1000 [00:03<00:58, 15.89it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:54, 18.17it/s][A
  0%|          | 4/1000 [00:00<00:58, 17.01it/s][A
  1%|          | 6/1000 [00:00<00:59, 16.63it/s][A
  1%|          | 8/1000 [00:00<01:01, 16.24it/s][A
  1%|          | 10/1000 [00:00<01:02, 15.94it/s][A
  1%|          | 12/1000 [00:00<01:01, 16.16it/s][A
  1%|▏         | 14/1000 [00:00<01:00, 16.20it/s][A
  2%|▏         | 16/1000 [00:00<01:00, 16.31it/s][A
  2%|▏         | 19/1000 [00:01<00:51, 19.05it/s][A
  2%|▏         | 22/1000 [00:01<00:47, 20.66it/s][A
  2%|▎         | 25/1000 [00:01<00:43, 22.18it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 23.07it/s][A
  3%|▎         | 31/1000 [00:01<00:40, 23.65it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 23.99it/s][A
  4%|▎         | 37/1000 [00:01<00:39, 24.22it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.35it/s][A
  4%|▍         | 43/1000 [00:02<00:39, 24.46it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.59it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.32it/s][A
  3%|▎         | 30/1000 [00:00<00:36, 26.61it/s] [A
  4%|▎         | 37/1000 [00:01<00:44, 21.72it/s][A
  4%|▍         | 42/1000 [00:01<00:47, 19.97it/s][A
  5%|▍         | 46/1000 [00:02<00:51, 18.71it/s][A
  5%|▍         | 49/1000 [00:02<00:53, 17.75it/s][A
  5%|▌         | 52/1000 [00:02<00:54, 17.24it/s][A
  5%|▌         | 54/1000 [00:02<00:56, 16.88it/s][A
  6%|▌         | 56/1000 [00:02<00:56, 16.56it/s][A
  6%|▌         | 58/1000 [00:02<00:57, 16.44it/s][A
  6%|▌         | 60/1000 [00:02<00:58, 16.03it/s][A
  6%|▌         | 62/1000 [00:03<01:00, 15.58it/s][A
  6%|▋         | 64/1000 [00:03<01:01, 15.30it/s][A
  7%|▋         | 66/1000 [00:03<01:01, 15.15it/s][A
  7%|▋         | 68/1000 [00:03<01:01, 15.15it/s][A
  7%|▋         | 70/1000 [00:03<01:01, 15.19it/s][A
  7%|▋         | 72/1000 [00:03<01:01, 15.17it/s][A
  7%|▋         | 74/1000 [00:03<01:00, 15.19it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:53, 18.61it/s][A
  0%|          | 4/1000 [00:00<00:58, 17.16it/s][A
  1%|          | 6/1000 [00:00<00:59, 16.57it/s][A
  1%|          | 8/1000 [00:00<01:02, 15.94it/s][A
  1%|          | 10/1000 [00:00<01:01, 16.12it/s][A
  1%|          | 12/1000 [00:00<01:00, 16.22it/s][A
  1%|▏         | 14/1000 [00:00<01:00, 16.30it/s][A
  2%|▏         | 16/1000 [00:00<01:00, 16.33it/s][A
  2%|▏         | 19/1000 [00:01<00:51, 19.05it/s][A
  2%|▏         | 22/1000 [00:01<00:46, 20.92it/s][A
  2%|▎         | 25/1000 [00:01<00:44, 22.15it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 22.97it/s][A
  3%|▎         | 31/1000 [00:01<00:41, 23.54it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 23.96it/s][A
  4%|▎         | 37/1000 [00:01<00:39, 24.27it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.39it/s][A
  4%|▍         | 43/1000 [00:02<00:38, 24.55it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.66it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.93it/s][A
  3%|▎         | 30/1000 [00:01<00:37, 26.04it/s] [A
  4%|▎         | 37/1000 [00:01<00:44, 21.77it/s][A
  4%|▍         | 42/1000 [00:01<00:47, 20.05it/s][A
  5%|▍         | 46/1000 [00:02<00:49, 19.19it/s][A
  5%|▍         | 49/1000 [00:02<00:51, 18.44it/s][A
  5%|▌         | 52/1000 [00:02<00:53, 17.66it/s][A
  6%|▌         | 55/1000 [00:02<00:54, 17.29it/s][A
  6%|▌         | 57/1000 [00:02<00:55, 16.87it/s][A
  6%|▌         | 59/1000 [00:02<00:56, 16.67it/s][A
  6%|▌         | 61/1000 [00:02<00:56, 16.66it/s][A
  6%|▋         | 63/1000 [00:03<00:56, 16.59it/s][A
  6%|▋         | 65/1000 [00:03<00:56, 16.41it/s][A
  7%|▋         | 67/1000 [00:03<00:58, 15.87it/s][A
  7%|▋         | 69/1000 [00:03<00:58, 15.93it/s][A
  7%|▋         | 71/1000 [00:03<00:58, 15.89it/s][A
  7%|▋         | 73/1000 [00:03<00:59, 15.58it/s][A
  8%|▊         | 75/1000 [00:03<00:59, 15.67it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:57, 17.43it/s][A
  0%|          | 4/1000 [00:00<00:59, 16.61it/s][A
  1%|          | 6/1000 [00:00<01:00, 16.55it/s][A
  1%|          | 8/1000 [00:00<01:00, 16.49it/s][A
  1%|          | 10/1000 [00:00<01:00, 16.46it/s][A
  1%|          | 12/1000 [00:00<00:59, 16.50it/s][A
  1%|▏         | 14/1000 [00:00<00:59, 16.45it/s][A
  2%|▏         | 16/1000 [00:00<00:59, 16.44it/s][A
  2%|▏         | 19/1000 [00:01<00:51, 19.14it/s][A
  2%|▏         | 22/1000 [00:01<00:46, 20.94it/s][A
  2%|▎         | 25/1000 [00:01<00:43, 22.18it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 23.03it/s][A
  3%|▎         | 31/1000 [00:01<00:41, 23.60it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 24.03it/s][A
  4%|▎         | 37/1000 [00:01<00:39, 24.30it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.33it/s][A
  4%|▍         | 43/1000 [00:02<00:39, 24.51it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.62it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 136.38it/s][A
  3%|▎         | 30/1000 [00:01<00:37, 25.99it/s] [A
  4%|▎         | 37/1000 [00:01<00:44, 21.52it/s][A
  4%|▍         | 42/1000 [00:01<00:47, 20.12it/s][A
  5%|▍         | 46/1000 [00:02<00:49, 19.23it/s][A
  5%|▍         | 49/1000 [00:02<00:51, 18.29it/s][A
  5%|▌         | 52/1000 [00:02<00:53, 17.59it/s][A
  6%|▌         | 55/1000 [00:02<00:55, 17.03it/s][A
  6%|▌         | 57/1000 [00:02<00:56, 16.69it/s][A
  6%|▌         | 59/1000 [00:02<00:57, 16.47it/s][A
  6%|▌         | 61/1000 [00:02<00:57, 16.45it/s][A
  6%|▋         | 63/1000 [00:03<00:57, 16.26it/s][A
  6%|▋         | 65/1000 [00:03<00:59, 15.75it/s][A
  7%|▋         | 67/1000 [00:03<01:00, 15.53it/s][A
  7%|▋         | 69/1000 [00:03<01:00, 15.42it/s][A
  7%|▋         | 71/1000 [00:03<01:00, 15.26it/s][A
  7%|▋         | 73/1000 [00:03<00:59, 15.48it/s][A
  8%|▊         | 75/1000 [00:03<00:59, 15.53it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:54, 18.32it/s][A
  0%|          | 4/1000 [00:00<00:57, 17.22it/s][A
  1%|          | 6/1000 [00:00<00:59, 16.75it/s][A
  1%|          | 8/1000 [00:00<00:59, 16.68it/s][A
  1%|          | 10/1000 [00:00<00:59, 16.57it/s][A
  1%|          | 12/1000 [00:00<00:59, 16.52it/s][A
  1%|▏         | 14/1000 [00:00<00:59, 16.47it/s][A
  2%|▏         | 16/1000 [00:00<01:00, 16.38it/s][A
  2%|▏         | 19/1000 [00:01<00:51, 19.08it/s][A
  2%|▏         | 22/1000 [00:01<00:46, 21.01it/s][A
  2%|▎         | 25/1000 [00:01<00:44, 22.15it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 22.92it/s][A
  3%|▎         | 31/1000 [00:01<00:41, 23.44it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 23.79it/s][A
  4%|▎         | 37/1000 [00:01<00:40, 24.04it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.09it/s][A
  4%|▍         | 43/1000 [00:02<00:39, 24.32it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.49it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.93it/s][A
  3%|▎         | 30/1000 [00:01<00:37, 25.90it/s] [A
  4%|▎         | 37/1000 [00:01<00:44, 21.49it/s][A
  4%|▍         | 42/1000 [00:01<00:47, 20.11it/s][A
  5%|▍         | 46/1000 [00:02<00:49, 19.21it/s][A
  5%|▍         | 49/1000 [00:02<00:52, 18.23it/s][A
  5%|▌         | 52/1000 [00:02<00:54, 17.45it/s][A
  6%|▌         | 55/1000 [00:02<00:55, 17.03it/s][A
  6%|▌         | 57/1000 [00:02<00:56, 16.66it/s][A
  6%|▌         | 59/1000 [00:02<00:56, 16.52it/s][A
  6%|▌         | 61/1000 [00:02<00:57, 16.47it/s][A
  6%|▋         | 63/1000 [00:03<00:57, 16.25it/s][A
  6%|▋         | 65/1000 [00:03<00:59, 15.76it/s][A
  7%|▋         | 67/1000 [00:03<01:00, 15.50it/s][A
  7%|▋         | 69/1000 [00:03<01:00, 15.48it/s][A
  7%|▋         | 71/1000 [00:03<01:00, 15.43it/s][A
  7%|▋         | 73/1000 [00:03<01:00, 15.35it/s][A
  8%|▊         | 75/1000 [00:03<01:00, 15.39it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:52, 18.85it/s][A
  0%|          | 4/1000 [00:00<01:00, 16.57it/s][A
  1%|          | 6/1000 [00:00<01:01, 16.12it/s][A
  1%|          | 8/1000 [00:00<01:00, 16.33it/s][A
  1%|          | 10/1000 [00:00<01:00, 16.37it/s][A
  1%|          | 12/1000 [00:00<01:00, 16.32it/s][A
  1%|▏         | 14/1000 [00:00<01:00, 16.27it/s][A
  2%|▏         | 16/1000 [00:00<01:00, 16.17it/s][A
  2%|▏         | 19/1000 [00:01<00:52, 18.84it/s][A
  2%|▏         | 22/1000 [00:01<00:47, 20.69it/s][A
  2%|▎         | 25/1000 [00:01<00:44, 21.93it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 22.87it/s][A
  3%|▎         | 31/1000 [00:01<00:41, 23.57it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 23.99it/s][A
  4%|▎         | 37/1000 [00:01<00:39, 24.25it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.28it/s][A
  4%|▍         | 43/1000 [00:02<00:39, 24.43it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.55it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.92it/s][A
  3%|▎         | 30/1000 [00:00<00:36, 26.44it/s] [A
  4%|▎         | 37/1000 [00:01<00:44, 21.74it/s][A
  4%|▍         | 42/1000 [00:01<00:47, 20.14it/s][A
  5%|▍         | 46/1000 [00:02<00:50, 18.94it/s][A
  5%|▍         | 49/1000 [00:02<00:52, 17.97it/s][A
  5%|▌         | 52/1000 [00:02<00:54, 17.27it/s][A
  5%|▌         | 54/1000 [00:02<00:55, 17.05it/s][A
  6%|▌         | 56/1000 [00:02<00:56, 16.60it/s][A
  6%|▌         | 58/1000 [00:02<00:57, 16.44it/s][A
  6%|▌         | 60/1000 [00:02<00:57, 16.27it/s][A
  6%|▌         | 62/1000 [00:03<00:59, 15.77it/s][A
  6%|▋         | 64/1000 [00:03<01:01, 15.34it/s][A
  7%|▋         | 66/1000 [00:03<01:01, 15.15it/s][A
  7%|▋         | 68/1000 [00:03<01:00, 15.33it/s][A
  7%|▋         | 70/1000 [00:03<00:59, 15.51it/s][A
  7%|▋         | 72/1000 [00:03<01:00, 15.27it/s][A
  7%|▋         | 74/1000 [00:03<01:00, 15.42it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:54, 18.34it/s][A
  0%|          | 4/1000 [00:00<00:59, 16.69it/s][A
  1%|          | 6/1000 [00:00<01:01, 16.04it/s][A
  1%|          | 8/1000 [00:00<01:01, 16.23it/s][A
  1%|          | 10/1000 [00:00<01:00, 16.28it/s][A
  1%|          | 12/1000 [00:00<01:00, 16.36it/s][A
  1%|▏         | 14/1000 [00:00<01:00, 16.32it/s][A
  2%|▏         | 16/1000 [00:00<00:59, 16.42it/s][A
  2%|▏         | 19/1000 [00:01<00:51, 19.10it/s][A
  2%|▏         | 22/1000 [00:01<00:46, 20.91it/s][A
  2%|▎         | 25/1000 [00:01<00:44, 22.10it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 22.94it/s][A
  3%|▎         | 31/1000 [00:01<00:41, 23.56it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 23.92it/s][A
  4%|▎         | 37/1000 [00:01<00:39, 24.27it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.33it/s][A
  4%|▍         | 43/1000 [00:02<00:39, 24.47it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.55it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.93it/s][A
  3%|▎         | 30/1000 [00:00<00:36, 26.72it/s] [A
  4%|▎         | 37/1000 [00:01<00:43, 21.92it/s][A
  4%|▍         | 42/1000 [00:01<00:47, 20.07it/s][A
  5%|▍         | 46/1000 [00:02<00:50, 19.03it/s][A
  5%|▍         | 49/1000 [00:02<00:52, 18.10it/s][A
  5%|▌         | 52/1000 [00:02<00:54, 17.42it/s][A
  6%|▌         | 55/1000 [00:02<00:55, 17.18it/s][A
  6%|▌         | 57/1000 [00:02<00:55, 16.92it/s][A
  6%|▌         | 59/1000 [00:02<00:57, 16.46it/s][A
  6%|▌         | 61/1000 [00:02<00:58, 16.17it/s][A
  6%|▋         | 63/1000 [00:03<00:59, 15.73it/s][A
  6%|▋         | 65/1000 [00:03<01:00, 15.47it/s][A
  7%|▋         | 67/1000 [00:03<01:00, 15.45it/s][A
  7%|▋         | 69/1000 [00:03<01:00, 15.34it/s][A
  7%|▋         | 71/1000 [00:03<01:00, 15.43it/s][A
  7%|▋         | 73/1000 [00:03<00:59, 15.53it/s][A
  8%|▊         | 75/1000 [00:03<01:00, 15.21it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:51, 19.50it/s][A
  0%|          | 4/1000 [00:00<00:56, 17.59it/s][A
  1%|          | 6/1000 [00:00<00:58, 16.93it/s][A
  1%|          | 8/1000 [00:00<00:59, 16.64it/s][A
  1%|          | 10/1000 [00:00<01:00, 16.49it/s][A
  1%|          | 12/1000 [00:00<01:00, 16.33it/s][A
  1%|▏         | 14/1000 [00:00<01:02, 15.86it/s][A
  2%|▏         | 16/1000 [00:00<01:01, 16.00it/s][A
  2%|▏         | 19/1000 [00:01<00:52, 18.78it/s][A
  2%|▏         | 22/1000 [00:01<00:47, 20.68it/s][A
  2%|▎         | 25/1000 [00:01<00:44, 21.98it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 22.93it/s][A
  3%|▎         | 31/1000 [00:01<00:41, 23.54it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 23.95it/s][A
  4%|▎         | 37/1000 [00:01<00:39, 24.14it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.12it/s][A
  4%|▍         | 43/1000 [00:02<00:39, 24.38it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.52it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.89it/s][A
  3%|▎         | 30/1000 [00:01<00:37, 26.12it/s] [A
  4%|▎         | 37/1000 [00:01<00:44, 21.52it/s][A
  4%|▍         | 42/1000 [00:01<00:47, 20.03it/s][A
  5%|▍         | 46/1000 [00:02<00:50, 18.86it/s][A
  5%|▍         | 49/1000 [00:02<00:52, 17.97it/s][A
  5%|▌         | 52/1000 [00:02<00:54, 17.26it/s][A
  5%|▌         | 54/1000 [00:02<00:55, 16.99it/s][A
  6%|▌         | 56/1000 [00:02<00:56, 16.57it/s][A
  6%|▌         | 58/1000 [00:02<00:57, 16.50it/s][A
  6%|▌         | 60/1000 [00:02<00:57, 16.43it/s][A
  6%|▌         | 62/1000 [00:03<00:57, 16.19it/s][A
  6%|▋         | 64/1000 [00:03<00:59, 15.63it/s][A
  7%|▋         | 66/1000 [00:03<01:00, 15.35it/s][A
  7%|▋         | 68/1000 [00:03<01:00, 15.37it/s][A
  7%|▋         | 70/1000 [00:03<01:00, 15.25it/s][A
  7%|▋         | 72/1000 [00:03<01:00, 15.39it/s][A
  7%|▋         | 74/1000 [00:03<01:00, 15.25it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 3/1000 [00:00<00:51, 19.35it/s][A
  0%|          | 5/1000 [00:00<00:56, 17.71it/s][A
  1%|          | 7/1000 [00:00<00:57, 17.23it/s][A
  1%|          | 9/1000 [00:00<00:58, 16.92it/s][A
  1%|          | 11/1000 [00:00<00:58, 16.77it/s][A
  1%|▏         | 13/1000 [00:00<00:59, 16.63it/s][A
  2%|▏         | 15/1000 [00:00<01:03, 15.63it/s][A
  2%|▏         | 18/1000 [00:01<00:53, 18.27it/s][A
  2%|▏         | 21/1000 [00:01<00:48, 20.21it/s][A
  2%|▏         | 24/1000 [00:01<00:45, 21.62it/s][A
  3%|▎         | 27/1000 [00:01<00:42, 22.63it/s][A
  3%|▎         | 30/1000 [00:01<00:41, 23.42it/s][A
  3%|▎         | 33/1000 [00:01<00:40, 23.90it/s][A
  4%|▎         | 36/1000 [00:01<00:39, 24.19it/s][A
  4%|▍         | 39/1000 [00:01<00:39, 24.20it/s][A
  4%|▍         | 42/1000 [00:01<00:39, 24.34it/s][A
  4%|▍         | 45/1000 [00:02<00:39, 24.48it/s][A
  5%|▍         | 48/1000 [00:02<00:38, 24.56it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.94it/s][A
  3%|▎         | 30/1000 [00:01<00:37, 26.03it/s] [A
  4%|▎         | 37/1000 [00:01<00:44, 21.47it/s][A
  4%|▍         | 42/1000 [00:01<00:48, 19.79it/s][A
  5%|▍         | 46/1000 [00:02<00:50, 18.92it/s][A
  5%|▍         | 49/1000 [00:02<00:53, 17.93it/s][A
  5%|▌         | 52/1000 [00:02<00:54, 17.30it/s][A
  5%|▌         | 54/1000 [00:02<00:55, 17.02it/s][A
  6%|▌         | 56/1000 [00:02<00:57, 16.56it/s][A
  6%|▌         | 58/1000 [00:02<00:57, 16.36it/s][A
  6%|▌         | 60/1000 [00:02<00:58, 15.98it/s][A
  6%|▌         | 62/1000 [00:03<00:59, 15.82it/s][A
  6%|▋         | 64/1000 [00:03<01:00, 15.48it/s][A
  7%|▋         | 66/1000 [00:03<01:01, 15.22it/s][A
  7%|▋         | 68/1000 [00:03<01:00, 15.40it/s][A
  7%|▋         | 70/1000 [00:03<01:01, 15.20it/s][A
  7%|▋         | 72/1000 [00:03<01:00, 15.35it/s][A
  7%|▋         | 74/1000 [00:03<01:00, 15.25it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 3/1000 [00:00<00:52, 19.13it/s][A
  0%|          | 5/1000 [00:00<00:56, 17.66it/s][A
  1%|          | 7/1000 [00:00<00:58, 17.03it/s][A
  1%|          | 9/1000 [00:00<00:59, 16.75it/s][A
  1%|          | 11/1000 [00:00<00:59, 16.51it/s][A
  1%|▏         | 13/1000 [00:00<01:01, 16.09it/s][A
  2%|▏         | 15/1000 [00:00<01:03, 15.40it/s][A
  2%|▏         | 18/1000 [00:01<00:54, 18.15it/s][A
  2%|▏         | 21/1000 [00:01<00:48, 20.20it/s][A
  2%|▏         | 24/1000 [00:01<00:45, 21.64it/s][A
  3%|▎         | 27/1000 [00:01<00:43, 22.61it/s][A
  3%|▎         | 30/1000 [00:01<00:41, 23.35it/s][A
  3%|▎         | 33/1000 [00:01<00:40, 23.85it/s][A
  4%|▎         | 36/1000 [00:01<00:39, 24.10it/s][A
  4%|▍         | 39/1000 [00:01<00:39, 24.16it/s][A
  4%|▍         | 42/1000 [00:02<00:39, 24.33it/s][A
  4%|▍         | 45/1000 [00:02<00:39, 24.48it/s][A
  5%|▍         | 48/1000 [00:02<00:38, 24.52it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:06, 142.10it/s][A
  3%|▎         | 31/1000 [00:01<00:38, 25.46it/s] [A
  4%|▍         | 38/1000 [00:01<00:44, 21.77it/s][A
  4%|▍         | 43/1000 [00:01<00:47, 20.30it/s][A
  5%|▍         | 47/1000 [00:02<00:50, 18.95it/s][A
  5%|▌         | 50/1000 [00:02<00:52, 18.11it/s][A
  5%|▌         | 53/1000 [00:02<00:53, 17.57it/s][A
  6%|▌         | 56/1000 [00:02<00:55, 17.07it/s][A
  6%|▌         | 58/1000 [00:02<00:55, 16.95it/s][A
  6%|▌         | 60/1000 [00:02<00:56, 16.73it/s][A
  6%|▌         | 62/1000 [00:03<00:58, 16.10it/s][A
  6%|▋         | 64/1000 [00:03<00:59, 15.66it/s][A
  7%|▋         | 66/1000 [00:03<00:59, 15.63it/s][A
  7%|▋         | 68/1000 [00:03<01:00, 15.42it/s][A
  7%|▋         | 70/1000 [00:03<01:00, 15.44it/s][A
  7%|▋         | 72/1000 [00:03<01:00, 15.27it/s][A
  7%|▋         | 74/1000 [00:03<00:59, 15.54it/s][A
  8%|▊         | 76/1000 [00:03<00:59, 15.58it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:54, 18.30it/s][A
  0%|          | 4/1000 [00:00<00:57, 17.20it/s][A
  1%|          | 6/1000 [00:00<00:59, 16.76it/s][A
  1%|          | 8/1000 [00:00<00:59, 16.74it/s][A
  1%|          | 10/1000 [00:00<00:59, 16.57it/s][A
  1%|          | 12/1000 [00:00<01:00, 16.46it/s][A
  1%|▏         | 14/1000 [00:00<00:59, 16.52it/s][A
  2%|▏         | 16/1000 [00:00<00:59, 16.52it/s][A
  2%|▏         | 19/1000 [00:01<00:51, 19.20it/s][A
  2%|▏         | 22/1000 [00:01<00:46, 20.97it/s][A
  2%|▎         | 25/1000 [00:01<00:43, 22.16it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 23.04it/s][A
  3%|▎         | 31/1000 [00:01<00:41, 23.60it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 24.04it/s][A
  4%|▎         | 37/1000 [00:01<00:39, 24.30it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.30it/s][A
  4%|▍         | 43/1000 [00:02<00:39, 24.46it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.59it/s][A
  5%|

training layer 0 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  2%|▏         | 16/1000 [00:00<00:07, 137.95it/s][A
  3%|▎         | 30/1000 [00:01<00:37, 25.88it/s] [A
  4%|▎         | 37/1000 [00:01<00:44, 21.40it/s][A
  4%|▍         | 42/1000 [00:01<00:48, 19.75it/s][A
  5%|▍         | 46/1000 [00:02<00:50, 18.90it/s][A
  5%|▍         | 49/1000 [00:02<00:52, 17.95it/s][A
  5%|▌         | 52/1000 [00:02<00:54, 17.27it/s][A
  5%|▌         | 54/1000 [00:02<00:55, 17.06it/s][A
  6%|▌         | 56/1000 [00:02<00:56, 16.60it/s][A
  6%|▌         | 58/1000 [00:02<00:57, 16.46it/s][A
  6%|▌         | 60/1000 [00:02<00:57, 16.35it/s][A
  6%|▌         | 62/1000 [00:03<00:59, 15.89it/s][A
  6%|▋         | 64/1000 [00:03<01:00, 15.53it/s][A
  7%|▋         | 66/1000 [00:03<01:00, 15.32it/s][A
  7%|▋         | 68/1000 [00:03<01:00, 15.28it/s][A
  7%|▋         | 70/1000 [00:03<00:59, 15.53it/s][A
  7%|▋         | 72/1000 [00:03<00:59, 15.49it/s][A
  7%|▋         | 74/1000 [00:03<01:00, 15.40it/s][A

training layer 1 ...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 2/1000 [00:00<00:50, 19.61it/s][A
  0%|          | 4/1000 [00:00<00:57, 17.32it/s][A
  1%|          | 6/1000 [00:00<01:00, 16.46it/s][A
  1%|          | 8/1000 [00:00<01:01, 16.00it/s][A
  1%|          | 10/1000 [00:00<01:00, 16.24it/s][A
  1%|          | 12/1000 [00:00<01:00, 16.30it/s][A
  1%|▏         | 14/1000 [00:00<01:00, 16.34it/s][A
  2%|▏         | 16/1000 [00:00<01:00, 16.36it/s][A
  2%|▏         | 19/1000 [00:01<00:51, 19.07it/s][A
  2%|▏         | 22/1000 [00:01<00:46, 20.87it/s][A
  2%|▎         | 25/1000 [00:01<00:44, 22.07it/s][A
  3%|▎         | 28/1000 [00:01<00:42, 22.86it/s][A
  3%|▎         | 31/1000 [00:01<00:41, 23.42it/s][A
  3%|▎         | 34/1000 [00:01<00:40, 23.82it/s][A
  4%|▎         | 37/1000 [00:01<00:39, 24.15it/s][A
  4%|▍         | 40/1000 [00:01<00:39, 24.30it/s][A
  4%|▍         | 43/1000 [00:02<00:39, 24.47it/s][A
  5%|▍         | 46/1000 [00:02<00:38, 24.69it/s][A
  5%|

 
FFNet train accuracy: 0.9479 +- 0.0010
FFNet test accuracy: 0.9485 +- 0.0017
FBNet train accuracy: 0.9910 +- 0.0005
FBNet test accuracy: 0.9729 +- 0.0012



